STAPL Standard Template Adaptive Parallel Library

Similar documents
The STAPL Parallel Container Framework

Parallel Statistical Analysis Using STAPL

THE STAPL PLIST. A Thesis XIABING XU

The STAPL plist. 1 Introduction and Motivation

Efficient Parallel Containers With Shared Object View in STAPL (Standard Template Adaptive Parallel Library)

MULTIMAP AND MULTISET DATA STRUCTURES IN STAPL

THE STAPL PARALLEL CONTAINER FRAMEWORK. A Dissertation ILIE GABRIEL TANASE

Matrix Multiplication Specialization in STAPL

Language and compiler parallelization support for Hash tables

Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB

HPF High Performance Fortran

Parallel Algorithms in STAPL: Implementation and Evaluation

The STAPL pview. Parasol Lab Dept. of Computer Science and Engineering Texas A&M University College Staion, TX

Design for Interoperability in stapl: pmatrices and Linear Algebra Algorithms

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

OPERATING SYSTEM. Chapter 4: Threads

A Framework for Adaptive Algorithm Selection in STAPL

Table of Contents. Cilk

Chapter 4: Threads. Chapter 4: Threads

CS420: Operating Systems

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Chapter 4: Multithreaded Programming

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Effective Performance Measurement and Analysis of Multithreaded Applications

Scalable Software Transactional Memory for Chapel High-Productivity Language

Language and Compiler Parallelization Support for Hashtables

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Marco Danelutto. May 2011, Pisa

Chapter 4: Multi-Threaded Programming

Parallel Algorithm Design. CS595, Fall 2010

Harp-DAAL for High Performance Big Data Computing

Introducing the Cray XMT. Petr Konecny May 4 th 2007

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

ARMI: A High Level Communication Library for STAPL

AutoTune Workshop. Michael Gerndt Technische Universität München

Introduction to Multicore Programming

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

An Algorithmic Approach to Communication Reduction in Parallel Graph Algorithms

Faster Simulations of the National Airspace System

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Chapter 4: Threads. Operating System Concepts 9 th Edition

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Programming Models for Supercomputing in the Era of Multicore

CUDA GPGPU Workshop 2012

Huge market -- essentially all high performance databases work this way

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Parallel Programming on Larrabee. Tim Foley Intel Corp

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Overview of research activities Toward portability of performance

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

STAPL: An Adaptive, Generic Parallel C++ Library

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

Chapter 4: Threads. Operating System Concepts 9 th Edition

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Introduction to Multicore Programming

Efficiently Introduce Threading using Intel TBB

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Threaded Programming. Lecture 9: Alternatives to OpenMP

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

Programming as Successive Refinement. Partitioning for Performance

Produced by. Design Patterns. MSc in Computer Science. Eamonn de Leastar

UOBPRM: A Uniformly Distributed Obstacle-Based PRM

ASYNCHRONOUS COMPUTING IN C++

Questions from last time

Graph Partitioning for Scalable Distributed Graph Computations

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Intel Thread Building Blocks

Performance Tools for Technical Computing

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Asynchronous K-Means Clustering of Multiple Data Sets. Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen

CS307: Operating Systems

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Design Principles for End-to-End Multicore Schedulers

Intel Thread Building Blocks

Shared-memory Parallel Programming with Cilk Plus

Parallel & Concurrent Haskell 1: Basic Pure Parallelism. Simon Marlow (Microsoft Research, Cambridge, UK)

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

EbbRT: A Framework for Building Per-Application Library Operating Systems

ECE 574 Cluster Computing Lecture 8

Custom Memory Allocation For Free

Multi Agent Navigation on GPU. Avi Bleiweiss

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Workload Characterization using the TAU Performance System

Introduction to Parallel Programming

Delft-Java Link Translation Buffer

Transcription:

STAPL Standard Template Adaptive Parallel Library Lawrence Rauchwerger Antal Buss, Harshvardhan, Ioannis Papadopoulous, Olga Pearce, Timmie Smith, Gabriel Tanase, Nathan Thomas, Xiabing Xu, Mauro Bianco, Nancy M. Amato http://parasol.tamu.edu/stapl Dept of Computer Science and Engineering, Texas A&M

Motivation Multicore systems: ubiquitous Problem complexity and size is increasing Dynamic programs are even harder Programmability needs to improve Portable performance is lacking Parallel programs are not portable Scalability & Efficiency is (usually) poor

STAPL: Standard Template Adaptive Parallel Library A library of parallel components that adopts the generic programming philosophy of the C++ Standard Template Library (STL) Application Development Components palgorithms, pcontainers, Views, prange Provide Shared Object View to eliminate explicit communication in application Portability and Optimization Runtime System(RTS) and Adaptive Remote Method Invocation (ARMI) Communication Library Framework for Algorithm Selection and Tuning (FAST)

Three STAPL Developer Levels Application Developer Writes application Uses pcontainers and palgorithms User Application Code Library Developer Writes new pcontainers and palgorithms Uses prange and RTS palgorithms Views pcontainers Run-time System Developer Ports system to new architectures Writes task scheduling modules Uses native threading and communication libraries ARMI Communication Library prange Run-time System Scheduler Executor Performance Monitor Pthreads, OpenMP, MPI, Native,

Applications Using STAPL Particle Transport - PDT Bioinformatics - Protein Folding Geophysics - Seismic Ray Tracing Aerospace - MHD Seq. Ctran code (7K LOC) STL (.K LOC) STAPL (.3K LOC)

pcontainers : Parallel Containers Container - Data structure with an interface to maintain and access a collection of generic elements STL (vector, list, map, set, hash), MTL [] (matrix), BGL [] (graph), etc. pcontainer - distributed storage and concurrent methods Shared Object View Compatible with sequential counterpart (e.g., STL) Thread Safe Support for user customization (e.g., data distributions) Currently Implemented: parray, pvector, plist, pgraph, pmatrix, passociative ] Matrix Template Library ] Boost Graph Library

pcontainer Framework Concepts and methodology for developing parallel containers pcontainers - a collection of base containers and information for parallelism management Improved user productivity Base classes providing fundamental functionality Inheritance Specialization Composition of existing pcontainers Scalable performance Distributed, non replicated data storage Parallel (semi-random) access to data Low overhead relative to the base container counterpart

pcontainer Framework Concepts Base Container : data storage sequential containers (e.g., STL, MTL, BGL) parallel containers (e.g., Intel TBB) Data Distribution Information - Shared object view - Global Identifier, Domain, Partition, Location, Partition Mapper p_array pa(6) 0 a b 3 4 5 c d e f User Level Data Distribution Info_0 Info_ Info_0 Data Distribution Info_ Base Containers a b c d a b c d e f e f Location 0 Location Location 0 Location

pcontainer Interfaces Constructors Default constructors May specify a desired data distribution Concurrent Methods Sync, async, split phase Views stapl_main(){ partition_block_cyclic partition(0); //argument is block size p_matrix<int> data(00, 00, partition); p_generate(data.view(), rand()); res=p_accumulate(data.view(),0); }

pgraph Methods Performance for add vertex and add edge asynchronously Weak scaling on CRAY using up to 4000 cores and on Power 5 cluster using up to 8 cores Torus with 500x500 vertices per processor CRAY XT4 Power 5

pgraph Algorithms Performance for find_sources and find_sinks in a directed graph Weak scaling on CRAY using up to 4000 cores and on Power 5 cluster using up to 8 cores Torus with 500x500 vertices per processor CRAY XT4 Power 5

Views A View defines an abstract data type that provides methods for access and traversal of the elements of a pcontainer that is independent of how the elements are stored in the pcontainer. Example: print the elements of a matrix Matrix Rows view Columns view 3 4 5 6 7 8 9 print(view v) for i= to v.size() do print(v[i]) 3 4 5 6 7 8 9 Output,,3,4,5,6,7,8,9 3 4 5 6 7 8 9 Output,4,7,,5,8,3,6,9

palgorithms Build and execute task graphs to perform computation Task graphs in STAPL are called pranges Easy to develop Work functions look like sequential code Work functions can call STAPL palgorithms prange factories simplify task graph construction STAPL palgorithms accelerate application development Basic building blocks for applications Parallel equivalents of STL algorithms Parallel algorithms for pcontainers Graph algorithms for pgraphs Numeric algorithms/operations for pmatrices

Parallel Find Find first element equal to the given value View::iterator p_find(view view, T value) return map_reduce( ); view, find_work_funtion(value), std::less() reduce operation map operation View::iterator find_work_function(view view) if (do_nesting()) return p_find(view, value) else return std::find(view.begin(), view.end(), value) endif end

Parallel Sample Sort palgorithm written using sequence of task graphs. p_sort(view view, Op comparator) // handle recursive call if (view.size() <= get_num_locations()) reduce(view, merge_sort_work_function(comparator)); sample_view = map(view, select_samples_work_function()); // sort the samples p_sort(sample_view, comparator); // paritition the data using the samples partitioned_view = map(view, full_overlap_view(sample_view), bucket_partition_work_function(comparator)); // sort each partition map(partitioned_view, sort_work_function(comparator));

Scalability of palgorithms Execution times for weak scaling of palgorithms on data stored in different pcontainers on CRAY XT4.

STAPL Runtime System Comm. Thread RMI Thread Task Thread Smart Application Application Specific Parameters STAPL RTS Advanced stage ARMI Executor Memory Manager Experimental stage: multithreading ARMI Executor Memory Manager Custom scheduling User-Level Dispatcher Kernel scheduling Operating System Kernel Scheduler (no custom scheduling, e.g. NPTL)

The STAPL Runtime System (RTS)... Abstracts platform resources threads, mutexes, atomics Provides consistent API and behavior across platforms Configured at compile-time for a specific platform Hardware counters, different interconnect characteristics Adapts at runtime at the runtime environment Available memory, communication intensity etc. Provides interface for calling functions on distributed objects ARMI Adaptive Remote Method Invocation There is one instance of the RTS running in every process So it is distributed as well

ARMI: Adaptive Remote Method Invocation Communication service of the RTS Provides two degrees of freedom Allows transfer of data, work, or both across the system Used to hide latency Used to call a function on a distributed object anywhere on the system Supports a mixed-mode operation (MPI+threads) async_rmi Asynchronous opaque_rmi multi_async_rmi multi_opaque_rmi Collective broadcast_rmi allreduce_rmi one-to-many multi_sync_rmi rmi_fence reduce_rmi rmi_barrier Synchronous Synchronization rmi_fence_os rmi_flush sync_rmi // Example of ARMI use async_rmi(destination, p_object, function, arg0,...); r = sync_rmi(destination, p_object, function, arg0,...);

The STAPL RTS: Major Components STAPL Runtime System Resource Reuse Thread Pool Memory Manager Hardware Monitoring Hardware Counters System resource usage Interfaces with TAU and OS provided system calls ARMI Adaptivity Feedback to pcontainers / prange Resource acquisition before needed Load monitoring and thread allocation Detection if on shared memory or have to use MPI Different Communication methods RMI request aggregation and combining Scheduler Execution of RMIs Consistency Models Executor Execution of tasks provided by the prange Multithreaded Support

FAST Architecture Framework for parallel algorithm selection Developer specifies important parameters, metrics to use Developer queries model produced to select implementation Selection process transparent to code calling algorithm STAPL User Code Parallel Algorithm Choices Adaptive Executable Data Characteristics Runtime Tests Model Selected Algorithm

Parallel Sorting: Experimental Results Attributes for Selection Model Processor Count Data Type Input Size Max Value (impacts radix sort) Presortedness SGI Altix Selection Model SGI Altix Validation Set (V) 00% Accuracy N=0M Procs: Normalized Execution Time -------------------------------------------------------MaxElement = 0,000------------------------------------------------------- 4 8 6 3 4 8 6 3 4 8 6 3 4 8 6 3 4 8 6 3 Radix Sample Column Adaptive Nearly / Integer Rev / Integer Rand / Integer Nearly / Double Rand / Double 30 5 0 5 0 5 0 Adaptive Performance Penalty

Parallel Sorting - Altix Relative Performance (V) Relative Speedup 0.8 0.6 0.4 Sample Column Radix Random Adaptive Best 0. 4 8 6 3 Processors Model obtains 99.7% of the possible performance. Next best algorithm (sample) provides only 90.4%.

PDT: Developing Applications with STAPL Important application for DOE E.g., Sweep3D and UMTK Large, on-going DOE project at TAMU to develop application in STAPL STAPL precursor used by PDT in DOE PSAAP center One sweep Eight simultaneous sweeps

pranges in PDT: Writing new palgorithms prb pra 4 3 8 5 7 3 6 9 pranges are sweeps in particle transport application B 6 6 5 0 5 0 9 4 9 4 A 3 8 3 8 7 7 3 4 7 0 3 8 4 7 5 8 6 9 5 0 3 6 9 Reflective materials on problem boundary create dependencies 9 5 7 3 9 5 3 0 6 8 4 0 6 3 7 3 9 5 7 3 8 4 0 6 8 6 3 4 7 30 5 30 9 8 3 3 zip(pra, prb, Zipper(4,3,4)); Composition operator will allow easy composition 3 4

Sweep Performance Weak scaling keeps number of unknowns per processor constant. Communication increases with processor count. KBA Model shows performance of perfectly scheduled sweep Divergence after 048 processors due to nonoptimal task scheduling

Conclusion STAPL allows productive parallel application development pcontainers and palgorithms Application building blocks Simplify development Extensibility enables easy development of new components Composition of pcontainers and palgorithms enable reuse RTS and FAST provide portability and adaptability