Enhancing the performance of Grid Applications with Skeletons and Process Algebras

Similar documents
Evaluating the Performance of Skeleton-Based High Level Parallel Programs

Why Structured Parallel Programming Matters. Murray Cole

Two Fundamental Concepts in Skeletal Parallel Programming

Using eskel to implement the multiple baseline stereo application

Edinburgh Research Explorer

Marco Danelutto. May 2011, Pisa

Predictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns

The MPI Message-passing Standard Lab Time Hands-on. SPD Course Massimo Coppola

Joint Structured/Unstructured Parallelism Exploitation in muskel

Lecture 17: Array Algorithms

Reliable Distributed System Approaches

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

Curriculum 2013 Knowledge Units Pertaining to PDC

Multi/Many Core Programming Strategies

New Communication Standard Takyon Proposal Overview

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Lecture 04 FUNCTIONS AND ARRAYS

An Overview of Parallel Computing

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

The PEPA Eclipse Plug-in

Skeleton programming environments Muesli (1)

ON IMPLEMENTING THE FARM SKELETON

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Euro-Par Pisa - Italy

The ASSIST Programming Environment

Where shall I parallelize?

JavaScript: Sort of a Big Deal,

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

UNIVERSITY OF MORATUWA

Point-to-Point Synchronisation on Shared Memory Architectures

Introduction to Runtime Systems

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Parallel Programming in C with MPI and OpenMP

Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm

Actors in the Small. Making Actors more Useful. Bill La

Benchmarks. Benchmark: Program used to evaluate performance. Uses. Guide computer design. Guide purchasing decisions. Marketing tool.

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Parallel Programming in C with MPI and OpenMP

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Remaining Contemplation Questions

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

Analysing Web Service Composition with PEPA

A Message Passing Standard for MPP and Workstations

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Remote Procedure Calls

6.9. Communicating to the Outside World: Cluster Networking

Parallelization Strategy

Towards a Domain-Specific Language for Patterns-Oriented Parallel Programming

Anna Morajko.

PyCUDA and PyUblas: Hybrid HPC in Python made easy

A Pattern-supported Parallelization Approach

CS454/654 Midterm Exam Fall 2004

Activities Radovan Cervenka

Appendix: Generic PbO programming language extension

Lecture 16: Static Semantics Overview 1

Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Interactions A link message

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Point-to-Point Communication. Reference:

Customer Success Onboarding Guide. Version 11.3

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

DPHPC: Introduction to OpenMP Recitation session

Marco Danelutto. October 2010, Amsterdam. Dept. of Computer Science, University of Pisa, Italy. Skeletons from grids to multicores. M.

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

EE178 Spring 2018 Lecture Module 4. Eric Crabill

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Written Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Comprehensive Guide to Evaluating Event Stream Processing Engines

From Craft to Science: Rules for Software Design -- Part II

UNIT I (Two Marks Questions & Answers)

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

DPHPC: Introduction to OpenMP Recitation session

Introduction to Computer Science and Business

Introduction. Chapter 1

Designing and debugging real-time distributed systems

Hardware-Software Codesign

Analysis and Design with the Universal Design Pattern

Message Passing and Threads

Compute-mode GPU Programming Interfaces

Request for Comments: 304 NIC: 9077 February 17, 1972 Categories: D3, D4, D7 Obsoletes: none Updates: none

Chapter 9 Multiprocessors

Automatic mapping of ASSIST applications using process algebra

High Performance Computing on GPUs using NVIDIA CUDA

Investigation of System Timing Concerns in Embedded Systems: Tool-based Analysis of AADL Models

CSL 860: Modern Parallel

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Why Multiprocessors?

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Advanced Topics in Distributed Systems. Dr. Ayman A. Abdel-Hamid. Computer Science Department Virginia Tech

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

Program analysis parameterized by the semantics in Maude

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

Parallel and Distributed Computing (PD)

Distributed Operation Layer

Assignment 4: Due Friday Mar 11 at 6pm

University of Toronto Faculty of Applied Science and Engineering

G Programming Languages - Fall 2012

Parallel Programming with MPI and OpenMP

Transcription:

Enhancing the performance of Grid Applications with Skeletons and Process Algebras (funded by the EPSRC, grant number GR/S21717/01) A. Benoit, M. Cole, S. Gilmore, J. Hillston http://groups.inf.ed.ac.uk/enhance/ abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 1

Introduction - Context of the work Parallel programs in a heterogeneous context run on a widely distributed collection of computers resource availability and performance unpredictable scheduling/rescheduling issues High-level parallel programming library of skeletons (parallel schemes) many real applications can use these skeletons modularity, configurability easier for the user Edinburgh Skeleton Library eskel (MPI) [Cole02] abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 2

Introduction - Performance evaluation Use of a particular skeleton: information about implied scheduling dependencies Model with stochastic process algebra include aspects of uncertainty automated modelling process dynamic monitoring of resource performance allow better scheduling decisions and adaptive rescheduling of applications Enhance the performance of parallel programs abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 3

Structure of the talk The Edinburgh Skeleton Library eskel and comparison with the P3L concepts Motivation and general concepts Skeletons in eskel Using eskel Performance models of skeletons Pipeline model AMoGeT (Automatic Model Generation Tool) Some results Conclusions and Perspectives abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 4

eskel - Brief history Concept of skeletons widely motivated eskel Murray Cole, 2002 Library of C functions, on top of MPI Address issues raised by skeletal programming eskel-2 Murray Cole and Anne Benoit, 2004 New interface and implementation More concepts addressed for more flexibility abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 5

eskel - Fundamental concepts Nesting Mode define how we can nest several skeletons together Interaction Mode define the interaction between different parts of skeletons, and between skeletons Data Mode related to these other concepts, define how the data are handled How do we address such issues in eskel? How are they addressed in P3L? abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 6

eskel - Nesting Mode Can be either transient or persistent Transient nesting an activity invokes another skeleton the nested skeleton carries or creates its own data Persistent nesting nested skeleton invoked once gets the data from the outer level skeleton Linked to the data mode (detailed later) abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 7

eskel - Nesting Mode: call tree Call tree built at the first interaction of each activity Structure of the persistently nested skeletons search in the tree to find interaction partners Transiently nested skeletons not in the main tree created dynamically, limited life time subtree built dynamically when invoked abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 8

eskel - Nesting Mode in P3L P3L (Anacleto, SkIE) all nesting of skeletons is persistent Defined within the P3L layer Clearly separated from the sequential code defining the activities P3L-based libraries (Lithium, SKElib) Concept of transient nesting not explicitly addressed Not forbidden but not supported ASSIST: not relevant abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 9

eskel - Interaction Mode Can be either implicit or explicit Implicit an activity has no control over its interactions function taking input data and returning output data Explicit interactions triggered in the activity code direct calls to the generic functions Take and Give Additional devolved mode for nested skeletons: the outer level skeleton may use the interaction mode of the inner skeleton abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 10

eskel - Interaction Mode in P3L P3L and related libraries Interaction via streams of data Implicitly defined by the skeleton ASSIST more flexibility implicit or explicit interaction is possible abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 11

eskel - Data Mode Related to the previous concepts Buffer mode / Stream mode BUF: data in a buffer (transient nesting) STRM: the data flow into the skeleton from the activities of some enclosing skeleton call (persistent nesting) eskel Data Model edm semantics of the interactions unit of transfer: edm molecule abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 12

eskel - eskel Data Model edm molecule: collection of edm atoms Type: defined using standard MPI datatypes edm atom: local versus global spread P0 Local Spread: 3 distinct items P1 P2 Global Spread: 1 item abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 13

eskel - Skeletons: brief description Pipeline & Farm: classical skeletons, defined in a very generic way Deal: similar to farm, except that the tasks are distributed in a cyclic order HaloSwap: 1-D array of single process activities, repeatedly (1) exchanging data with immediate neighbours, (2) processing data locally, (3) deciding collectively whether to proceed with another iteration Butterfly: class of divide & conquer algorithms abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 14

eskel - Skeletons: task/data parallel Skeletons are commonly classified as task parallel: dynamic communication processes to distribute the work pipeline, farm data parallel: works on a distributed data structure map, fold control skeletons: sequential modules and iteration of skeletons seq, loop eskel: only requires task parallel skeletons data parallel skeletons: use of the edm control expressed directly through the C/MPI code abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 15

eskel - Skeletons: interface eskel: P3L: not meant to be easy based on MPI, the user must be familiar with it structuring parallel MPI code much easier to use, simple structure less flexibility, structuring sequential code data/task parallel and control skeletons 3-stage pipeline: (create data, process, collect output) abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 16

eskel - Interface: Pipeline void Pipeline (int ns, Amode t amode[], eskel molecule t * (*stages[])(eskel molecule t *), int col, Dmode t dmode, spread t spr[], MPI Datatype ty[], void *in, int inlen, int inmul, void *out, int outlen, int *outmul, int outbuffsz, MPI Comm comm); general information about pipeline (ns,...) specify the several modes: interaction mode (amode); data mode (dmode), spread (spr) and type (ty) information relative to the input buffer information relative to the output buffer abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 17

eskel - Interface: Deal void Deal (int nw, Amode t amode, eskel molecule t *worker (eskel molecule t *), int col, Dmode t dmode, void *in, int inlen, int inmul, spread t inspr, MPI Datatype inty, void *out, int outlen, int *outmul, spread t outspr, MPI Datatype outty, int outbuffsz, MPI Comm comm); general information about deal (nw,...) specify the several modes: interaction mode (amode) and data mode (dmode) information relative to the input buffer information relative to the output buffer abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 18

eskel - Use of the library C/MPI program calling skeletons functions Great care should be taken for the parameters Definition of nested skeletons, workers,... through standard C/MPI functions Only Pipeline and Deal implemented so far in eskel version 2.0 Demonstration of the use of eskel abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 19

Structure of the talk The Edinburgh Skeleton Library eskel Motivation and general concepts Skeletons in eskel Using eskel Performance models of skeletons Pipeline model AMoGeT (Automatic Model Generation Tool) Some results Conclusions and Perspectives abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 20

Pipeline - Principle of the skeleton inputs Stage 1 Stage 2... Stage Ns outputs stages process a sequence of inputs to produce a sequence of outputs All input passes through each stage in the same order The internal activity of a stage may be parallel, but this is transparent to our model Model: mapping of the application onto the computing resources: the network and the processors abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 21

!!!!. 543 6 543 Pipeline - Application model Application model: independent of the resources 1 PEPA component per stage of the pipeline ( def Sequential component: gets data ( ( ), moves the data to the next stage ( ) " ), processes it ) " Unspecified rates ( ): determined by the resources Pipeline application = cooperation of the stages %$'& #$ Boundary: def ",+*) (. 1,+*) ( 0/. -,+*) ( : arrival of an input in the application 6 : transfer of the final output out of the Pipeline 2 abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 22

$ ; ; 7 9 D A > Pipeline - Network model Network model: information about the efficiency of the link connection between pairs of processors Assign rates to the 7 activities ( ) 8 9 8 < : 7 9 ; def : < : " " represents the connection between the processor = hosting stage ACB and the processor hosting stage > > @? Boundary cases: >FE is the processor providing inputs to the Pipeline " is where we want the outputs to be delivered abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 23

$!! #!! # H /!! /!! # J J # #!! Pipeline - Processors model Processors model: Application mapped on a set of processors G Rate of the H activities ( processor, and other performance information 8 ): load of the One stage per processor ( def G # H $ I 8 ): Several stages per processor: def 9 # H Set of processors: parallel composition!def K # J J JJ / # abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 24

] _ U ] W _ U _ U N k k m l Pipeline - Overall model The overall model is the mapping of the stages onto the processors and the network by using the cooperation combinator synchronize V VUT SR P V VUT SR OQP N W@XY YY X L M Z\[ R V S V VUT SR and ^@` U ^ P synchronize U cs b U cs OQb N X Y Y W XY L a e Zd[ h R f'g S U and ^@` U ^ P R V S V VUT SR ^@` U ^ P h R f'g S U jdef ^ ` P Pi abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 25

AMoGeT - Overview AMoGeT Generate models PEPA models PEPA Workbench results Compare results description files information from NWS performance information AMoGeT: Automatic Model Generation Tool Generic analysis component Ultimate role: integrated component of a run-time scheduler and re-scheduler abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 26

$ $ 8 AMoGeT - Description files (1) Specify the names of the processors file hosts.txt: list of IP addresses rank in the list processor processor is the reference processor wellogy.inf.ed.ac.uk bw240n01.inf.ed.ac.uk bw240n02.inf.ed.ac.uk france.imag.fr abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 27

! AMoGeT - Description files (2) Describe the modelled application mymodel file mymodel.des stages of the Pipeline: number of stages time each stage nbstage= and (sec) required to compute one output for 8 on the reference processor ; tr1=10; tr2=2;... mappings of stages to processors: location of the input data, the processor where each stage is processed, and where the output data must be left. mappings=[1,(1,2,3),1],[1,(1,1,1),1]; abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 28

$ p $ $ q AMoGeT - Using the Network Weather Service The Network Weather Service (NWS) [Wolski99] Dynamic forecast of the performance of network and computational resources Just a few scripts to run on the monitored nodes Information we use: fraction of CPU available to a newly-started process on the processor o n % - latency (in ms) of a communication from processor to processor - frequency of the processor in MHz (/proc/cpuinfo) abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 29

p!! & 8 q o o s s q &! 8! u v8! { AMoGeT - Generating the models One Pipeline model per mapping Problem: computing the rates Stage total of ( r o ) hosted on processor 8 stages hosted on this processor): (and a H r o Rate ( 7 processor processor 8 9 t 8 p? p hosting stage hosting stage (boundary cases: stage ): connection link between the : 7 = input and stage and the o z n o y % wx W e Z [ = output) abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 30

cs b AMoGeT - Solving the models and comparing the results Numerical results obtained with the PEPA Workbench [Gilmore94] Performance result: throughput of the activities = throughput of the application Result obtained via a simple command line, all the results saved in a single file Which mapping produces the best throughput? Use this mapping to run the application U} abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 31

~ D > Œ Numerical Results Example 1: 3 Pipeline stages, up to 3 processors 27 states, 51 transitions less than 1 second to solve latency of the com D sec; all stages/processors are identical; time required to complete a stage ƒ ˆ ( Š o : nb stages on processor Š o ) Mappings compared: all the mappings with the first stage on the first processor (mappings D ) Ž Ž Ž D D : optimal mappings (1,2,3) and (1,3,2) with a throughput of 5.64 : same optimal mappings (one stage on each processor), but throughput divided by 2 (2.82) abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 32

D Numerical Results Example 2: 4 Pipeline stages, up to 4 processors, 0.06 0.05 Mappings: [1,(1,1,1,1),1] [1,(1,1,2,2),1] [1,(1,1,2,3),1] [1,(1,2,3,4),1] Throughput 0.04 0.03 0.02 0.01 0 0 10 20 30 40 50 Latency (seconds) abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 33

Structure of the talk The Edinburgh Skeleton Library eskel Motivation and general concepts Skeletons in eskel Using eskel Performance models of skeletons Pipeline model AMoGeT (Automatic Model Generation Tool) Some results Conclusions and Perspectives abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 34

Conclusions - Part 1 Why structured parallel programming matters? (Murray Cole s invited talk to EuroPar 2004 in Pisa) Presentation of the Edinburgh Skeleton Library eskel Concepts at the basis of the library How do we address these concepts Comparison with the P3L language and concepts, designed in Pisa. abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 35

Perspectives - Part 1 (1) eskel: ongoing development and implementation phase Still several skeletons to implement Interface could be made a bit easier and user-friendly Necessity of a debugging mode to help the writing of application (checking the correctness of the definitions in the application code, the coherence between modes,...) Demo for interested people abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 36

Perspectives - Part 1 (2) Validation of these concepts Develop a real application with eskel Promote the idea of skeletons Comparison with other approaches P3L, ASSIST,... Kuchen s skeleton library Parallel functional language Eden Motivation for my visit in Pisa abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 37

Conclusions - Part 2 Use of skeletons and performance models to improve the performance of high-level parallel programs Pipeline and Deal skeleton Tool AMoGeT which automates all the steps to obtain the result easily Models: help us to choose the mapping to produce the best throughput of the application Use of the Network Weather Service to obtain realistic models abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 38

Perspectives - Part 2 Provide more detailed timing information on the tool to prove its usefulness - Recent work Extension to other skeletons Experiments with a realistic application on an heterogeneous computational Grid Integrate in a graphical tool to help the design of applications with eskel First case study we have the potential to enhance the performance of high-level parallel programs with the use of skeletons and process algebras abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 39

Thank you for you attention! Grazie per la vostra attenzione! Any questions? abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 40

Related projects The Network Weather Service [Wolski99] benchmarking and monitoring techniques for the Grid no skeletons and no performance models ICENI project [Furmento02] performance models to improve the scheduling decisions no skeletons, models = graphs which approximate data Use of skeleton programs within grid nodes [Alt02] each server provides a function capturing the cost of its implementation of each skeleton each skeleton runs only on one server scheduling = select the most appropriate servers abenoit1@inf.ed.ac.uk Seminar in Pisa 7th December 2004 41