Parallel and Distributed Computing
|
|
- Lorena Phillips
- 5 years ago
- Views:
Transcription
1 Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 26, 2008 José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
2 Outline Cache Coherent NUMA AMD Opteron IBM Cell Broadband Engine programming NUMA systems OpenCL MapReduce José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
3 Shared-Memory Systems also known as Uniform Memory Access (UMA) architecture Symmetric Shared-Memory Multiprocessors (SMP) P P P P Main Memory I / O José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
4 Distributed-Memory Systems or Non-Uniform Memory Access (NUMA) architecture Multicomputers P P Cache Cache Main Memory I / O Main Memory I / O Interconnection Network Main Memory I / O Main Memory I / O Cache Cache P P José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
5 Cache-Coherent NUMA Limitations of UMA / SMP: José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
6 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
7 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: high communication overhead! José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
8 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: high communication overhead! Intermediate solution: Cache-Coherent NUMA, or ccnuma (also known as Distributed Shared Memory, DSM). Examples: IBM Cell, AMD Opteron José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
9 Cache-Coherent NUMA P P Cache Cache Main Memory I / O Main Memory I / O Interconnection Network Main Memory I / O Main Memory I / O Cache Cache P P José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
10 Cache-Coherent NUMA P P Main Memory Cache Main Memory Cache I / O Main Memory Cache P Main Memory Cache P highly scalable memory bandwidth grows with computational power cache coherence possible due to shared global bus José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
11 AMD Opteron each AMD s Opteron chip has its own memory controller, allowing for easy system extension each node may be a single- or multi-core each node has L1 and L2 caches José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
12 IBM Cell Broadband Engine Heterogeneous multiprocessor: Power Processing Element (PPE): Master processor 8 Synergistic Processing Elements (SPE): fully functional RISC processors Local storage size per SPE: 256kB SPEs can only access own local memory José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
13 NUMA Aware Systems For optimal performance on NUMA systems: processes should be located on processors that are as close as possible to the memory that the process accesses allocate all memory for a process in the same processor OS with multi-queue scheduler with a runqueue per processor dispatch all child processes on the same processor through the life of the parent processes Linux and Windows OSs are NUMA ready. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
14 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
15 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process run process on node 0 with memory allocated on node 0 and 1. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
16 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process run process on node 0 with memory allocated on node 0 and 1. numactl --show show the NUMA state José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
17 Programming NUMA Systems gcc provides a library with a simple programming interface of NUMA systems. #include <numa.h> gcc... -lnuma Defines policies for: thread binding memory allocation Before any other routine is used, int numa available() must be called. If it returns -1, all other functions in this library are undefined. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
18 Programming NUMA Systems Querying the system: int numa max node() returns the number of nodes in the systems long numa node size(int node, long *freep) returns the memory size of node node, and the current free memory in freep int numa distance(int node1, int node2) reports the distance in the machine topology between two nodes. The factors are a multiple of 10. It returns 0 when the distance cannot be determined. A node has distance 10 to itself. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
19 Programming NUMA Systems Querying the system: int numa max node() returns the number of nodes in the systems long numa node size(int node, long *freep) returns the memory size of node node, and the current free memory in freep int numa distance(int node1, int node2) reports the distance in the machine topology between two nodes. The factors are a multiple of 10. It returns 0 when the distance cannot be determined. A node has distance 10 to itself. Thread binding: int numa run on node(int node) binds the current thread and its children to node node (for a set of nodes, a nodemask can be specified) José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
20 Programming NUMA Systems Memory allocation: void *numa alloc onnode(size t size, int node) allocates size bytes of memory on a specific node node void *numa alloc local(size t size) allocates size bytes of memory on the local node void *numa alloc interleaved(size t size) allocates size bytes of memory page interleaved on all nodes void *numa alloc(size t size) allocates size bytes of memory with the current NUMA policy void numa free(void *start, size t size) frees size bytes of memory starting at start, allocated by the numa alloc * functions above José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
21 Programming NUMA Systems Memory allocation: void *numa alloc onnode(size t size, int node) allocates size bytes of memory on a specific node node void *numa alloc local(size t size) allocates size bytes of memory on the local node void *numa alloc interleaved(size t size) allocates size bytes of memory page interleaved on all nodes void *numa alloc(size t size) allocates size bytes of memory with the current NUMA policy void numa free(void *start, size t size) frees size bytes of memory starting at start, allocated by the numa alloc * functions above Node masks: Define a subset of nodes to which thread binding and memory allocation apply. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
22 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; Sun Sparc Rock. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
23 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; Sun Sparc Rock. GPUs with increased number of SIMD cores Example: NVIDIA GTX 280 / GTX 260 José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
24 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; Sun Sparc Rock. GPUs with increased number of SIMD cores Example: NVIDIA GTX 280 / GTX 260 CPU / GPU convergence Examples: Intel Larrabee; AMD / ATI Fusion Future trend is: many simple cores each core with vector (SIMD) capabilities (of growing length) José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
25 Future Architecture Trends José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
26 Parallel Programming Models OpenMP, PThreads for SMP systems libnuma for ccnuma systems CUDA for NVIDIA s GPUs; CTU (Close To Metal) for ATI s GPUs Message Passing Interface, MPI This disparate set of models creates challenges of targeting algorithms to optimally exploit available computational power. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
27 Parallel Programming Models OpenMP, PThreads for SMP systems libnuma for ccnuma systems CUDA for NVIDIA s GPUs; CTU (Close To Metal) for ATI s GPUs Message Passing Interface, MPI This disparate set of models creates challenges of targeting algorithms to optimally exploit available computational power. Parallel algorithms need to address combinations of these models! José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
28 OpenCL Many issues are common to all parallel programming models! OpenCL cross-platform language recently proposed for data (and task) parallel programming for both GPUs and CPUs. OpenCL was created by Apple in cooperation with others, and will be an open standard administered by the Khronos Group. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
29 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
30 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
31 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
32 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
33 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation Enable embedded and handheld devices through an embedded profile in the specification José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
34 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation Enable embedded and handheld devices through an embedded profile in the specification Drive future hardware requirements e.g. floating point precision requirements José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
35 MapReduce Paradigm MapReduce: a simple programming model, proposed by Google, motivated by large-scale data processing, applicable to many computing problems MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
36 MapReduce Usage Programmer specifies two functions: map (in key, in value) list(out key, intermediate value) Processes input key/value pair Produces set of intermediate pairs reduce (out key, list(intermediate value)) list(out value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
37 Example Map and Reduce Functions map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v=key in values: result += ParseInt(v); Emit(AsString(result)); José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
38 MapReduce Execution Overview José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
39 MapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
40 MapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
41 MapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
42 MapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
43 MapReduce Fault tolerance On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don t yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
44 Review Cache Coherent NUMA AMD Opteron IBM Cell Broadband Engine programming NUMA systems OpenCL MapReduce José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
45 Next Classes efficient parallelization of common algorithms sorting search numerical algorithms José Monteiro (DEI / IST) Parallel and Distributed Computing / 26
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1 Motivation Summary Example Implementation
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationMultiprocessors 2014/2015
Multiprocessors 2014/2015 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel / distributed processing:
More informationParallel Computing: MapReduce Jin, Hai
Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationParallel Programming Concepts
Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al. Examples for Parallel Programming Support 2 MapReduce 3 Programming model
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing
ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters
More informationMapReduce: A Programming Model for Large-Scale Distributed Computation
CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview
More informationConcurrency for data-intensive applications
Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationCS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationCS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.
CS 138: Google CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationLarge-Scale GPU programming
Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationGetting Performance from OpenMP Programs on NUMA Architectures
Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationA NUMA API for LINUX*
A NUMA API for LINUX* Technical Linux Whitepaper w w w. n o v e l l. c o m April 2005 2 Disclaimer Trademarks Copyright Novell, Inc.makes no representations or warranties with respect to the contents or
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationCloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe
Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationMapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35
MapReduce Kiril Valev LMU valevk@cip.ifi.lmu.de 23.11.2013 Kiril Valev (LMU) MapReduce 23.11.2013 1 / 35 Agenda 1 MapReduce Motivation Definition Example Why MapReduce? Distributed Environment Fault Tolerance
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationOS impact on performance
PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationMultiprocessors 2007/2008
Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationL22: SC Report, Map Reduce
L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationTHE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD
THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU
More informationCell Broadband Engine. Spencer Dennis Nicholas Barlow
Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:
The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers
William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 7: Parallel Computing Cho-Jui Hsieh UC Davis May 3, 2018 Outline Multi-core computing, distributed computing Multi-core computing tools
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationParallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationPCS - Part Two: Multiprocessor Architectures
PCS - Part Two: Multiprocessor Architectures Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2008 Part 2 - Contents Multiprocessor Systems Symmetrical Multiprocessors
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationUnified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association
Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some
More informationHSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!
Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationNUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems
NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana
More informationExploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,
More informationOpenCL: History & Future. November 20, 2017
Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationNUMA Support for Charm++
NUMA Support for Charm++ Christiane Pousa Ribeiro (INRIA) Filippo Gioachin (UIUC) Chao Mei (UIUC) Jean-François Méhaut (INRIA) Gengbin Zheng(UIUC) Laxmikant Kalé (UIUC) Outline Introduction Motivation
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationDistributed Systems. Lecture 4 Othon Michail COMP 212 1/27
Distributed Systems COMP 212 Lecture 4 Othon Michail 1/27 What is a Distributed System? A distributed system is: A collection of independent computers that appears to its users as a single coherent system
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationProfessional Multicore Programming. Design and Implementation for C++ Developers
Professional Multicore Programming Design and Implementation for C++ Developers Cameron Hughes Tracey Hughes WILEY Wiley Publishing, Inc. Introduction xxi Chapter 1: The New Architecture 1 What Is a Multicore?
More information