What is Performance for Internet/Grid Computation?

Similar documents
CSC630/CSC730 Parallel & Distributed Computing

1 Introduction to Parallel Computing

Chapter 5: Analytical Modelling of Parallel Programs

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Lecture 7: Parallel Processing

Performance Metrics. Measuring Performance

Lecture 7: Parallel Processing

Design of Parallel Algorithms. Course Introduction

Analytical Modeling of Parallel Programs

Parallel Programming with MPI and OpenMP

Master-Worker pattern

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Introduction to parallel computing

SCALABILITY ANALYSIS

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Evaluation of Parallel Programs by Measurement of Its Granularity

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Understanding Parallelism and the Limitations of Parallel Computing

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Interconnect Technology and Computational Speed

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Analytical Modeling of Parallel Programs

EE/CSCI 451: Parallel and Distributed Computation

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

CS 475: Parallel Programming Introduction

BİL 542 Parallel Computing

Chapter 3: A Quantative Basis for Design

Master-Worker pattern

Search Algorithms for Discrete Optimization Problems

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

The typical speedup curve - fixed problem size

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

arxiv: v1 [cs.dc] 2 Apr 2016

Parallel and Distributed Computing (PD)

Parallel Traveling Salesman. PhD Student: Viet Anh Trinh Advisor: Professor Feng Gu.

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Chapter 11 Search Algorithms for Discrete Optimization Problems

Multicore Programming

Partitioning. Partition the problem into parts such that each part can be solved separately. 1/22

Lecture 5: Search Algorithms for Discrete Optimization Problems

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Programming at Scale: Concurrency

Pipelined Computations

An Introduction to Parallel Programming

Analyzing Performance Asymmetric Multicore Processors for Latency Sensitive Datacenter Applications

On the Scalability of Hierarchical Ad Hoc Wireless Networks

Parallel Programming with MPI and OpenMP

High Performance Computing Systems

Programming as Successive Refinement. Partitioning for Performance

COSC 6385 Computer Architecture - Multi Processor Systems

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015

Multiprocessors & Thread Level Parallelism

A closer look at network structure:

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

Lecture 9: Load Balancing & Resource Allocation

Blocking SEND/RECEIVE

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

Performance analysis. Performance analysis p. 1

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Lecture 3: Sorting 1

Parallel Interval Analysis for Chemical Process Modeling

Parallelization Principles. Sathish Vadhiyar

Power Laws in ALOHA Systems

Huge market -- essentially all high performance databases work this way

Parallel Computing Concepts. CSInParallel Project

PARALLEL AND DISTRIBUTED COMPUTING

Introduction. HPC Fall 2007 Prof. Robert van Engelen

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

27. Parallel Programming I

Computer Architecture

Introduction to Modeling. Lecture Overview

Lecture 4: Graph Algorithms

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Introduction to parallel Computing

CSL 860: Modern Parallel

NAMD Serial and Parallel Performance

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

ECE Spring 2017 Exam 2

Parallel quicksort algorithms with isoefficiency analysis. Parallel quicksort algorithmswith isoefficiency analysis p. 1

ECE 563 Spring 2012 First Exam

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Draft Notes 1 : Scaling in Ad hoc Routing Protocols

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

PARALLEL MEMORY ARCHITECTURE

Lecture 4: Locality and parallelism in simulation I

ECE 669 Parallel Computer Architecture

Introduction to Parallel Computing

Brief Background in Fiber Optics

Online Course Evaluation. What we will do in the last week?

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Mapping pipeline skeletons onto heterogeneous platforms

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Topic 6: SDN in practice: Microsoft's SWAN. Student: Miladinovic Djordje Date:

Transcription:

Goals for Internet/Grid Computation? Do things you cannot otherwise do because of: Lack of Capacity Large scale computations Cost SETI Scale/Scope of communication Internet searches All of the above 9/10/2002 Internet/Grid Computing - Fall 2002 1

Assumptions Limitations Metrics Sequential Computation uniform access to memory, dedicated resources, completely reliable, static bounded resources. processor speed, memory size, i/o bandwidth, storage capacity Time to complete Cost per unit result Design Trade-offs Store versus recompute Algorithm complexity 9/10/2002 Internet/Grid Computing - Fall 2002 2

Assumptions Limitations Metrics Parallel Computation uniform processors, dedicated, static reliable but bounded resources processor speed, memory per processor, number of processors, i/o bandwidth and storage capacity, communication latency, bandwidth and contention Speed-up, time to complete, relative efficiency, cost per unit result. Design Trade-offs Design paradigm, partitioning strategy, store versus recompute algorithm choice 9/10/2002 Internet/Grid Computing - Fall 2002 3

Internet Computation Assumptions multiple execution hosts, dynamic environment, highly communication channels, unreliable resources, essentially unbounded resources Limitations access to resources, management of resources, fault-tolerance. Metrics time to complete, response times, cost per unit result, speed-up, relative efficiency Design Issues control strategy, heterogeneous resource management, design paradigm, store versus recompute, algorithm choice, fault management, adaptation/reconfiguration strategy. 9/10/2002 Internet/Grid Computing - Fall 2002 4

1. Discovery of Resources New Issues 2. Adaptive Resource Scheduling Old Issues in Different Guises 1. Scheduling for Heterogeneous Resources 2. Fault-Tolerance 3. Security and malicious behaviors 9/10/2002 Internet/Grid Computing - Fall 2002 5

Fundamental Question What are the characteristics of applications will benefit from Internet scale execution environments? Communication is intrinsic value. Can consume arbitrarily large amounts of resources in small units. Can be decomposed into many parallel tasks of variable size without unacceptable execution cost. Can tolerate all of diversity, change and faults.????????? 9/10/2002 Internet/Grid Computing - Fall 2002 6

Relative Speed/Cost of Computation versus Communication versus Storage Computation power seems to be increasing at about a factor of two every two years or less at constant unit cost. Storage size seems to be increasing at about a factor of two every two years or less at constant unit cost. Communication Bandwidth seems to be increasing at about a factor of???? Every???? Years at???? unit cost. 9/10/2002 Internet/Grid Computing - Fall 2002 7

Communication Performance Of Optical Networks 9/10/2002 Internet/Grid Computing - Fall 2002 8 Scientific American, January 2001

Quantifiable Design Issues 1. Scalable formulation paradigms 2. Speed up, performance and efficiency Internet/Grid applications are intrinsically distributed parallel programs. Review basics of speed up and efficiency for distributed parallel execution. Consider new issues from dynamic resource sets, fault management etc. 9/10/2002 Internet/Grid Computing - Fall 2002 9

Speed up for distributed parallel execution 1. Parallelizability Amdahl s Law 2. Communication 3. Synchronization 4. Partitioning and Load Balance Computation Communication 9/10/2002 Internet/Grid Computing - Fall 2002 10

Amdahl's Law Speed Up Assume an infinite number of processors and no communication or wait time. Then: Amdahl's Law states that potential program speedup is defined by the fraction of code (P) which can be parallelized: 1 speedup = -------- 1 - P If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, P = 1 and the speedup is infinite (in theory). If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast. 9/10/2002 Internet/Grid Computing - Fall 2002 11

Amdahl s Law - Continued Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by: 1 speedup = ------------ P/N + S where P = parallel fraction, N = number of processors and S = serial fraction. There are limits to the scalability of parallelism. For example, at P =.50, 0.90 and 0.99 (50%, 90% and 99% of the code is parallelizable): speedup -------------------------------- N P =.50 P =.90 P =.99 ----- ------- ------- ------- 10 1.82 5.26 9.17 100 1.98 9.17 50.25 1000 1.99 9.91 90.99 10000 1.99 9.91 99.02 9/10/2002 Internet/Grid Computing - Fall 2002 12

Speed-ups are measured by using the best sequential algorithm executing in a single processor identical to those used for parallel execution. Speed up is also function of the extra work which must be done when the computation is executed in parallel. Communication Wait or Synchronization Time Comment: For those applications where communication is the goal, speed up is meaningless. 9/10/2002 Internet/Grid Computing - Fall 2002 13

Communication Assume that the serial fraction of the computation is zero and that all tasks require exactly the same computation time. Then speed up is bounded by the time taken for any communication among the tasks which are executing in parallel which is not required for sequential single memory execution. Communication time can never be zero. Why? Comment: For some applications, the task is communication and the overhead is the computation time necessary to implement communication. 9/10/2002 Internet/Grid Computing - Fall 2002 14

Simple Analytic Models - Comp+Comm+Idle Execution behavior of a program 9/10/2002 Internet/Grid Computing - Fall 2002 15

Role of Communication Time 1. Assume the serial fraction of the computation is zero. 2. Assume that the task is decomposed into a sequence of computations followed by sending a message. 3. Assume each processor takes exactly the same time to execute the computation. 4. Assume all processors send the same data in each message for each unit computation and that the transmission time for each message is uniform. 5. Assume that the time taken to send a message is additive to the time taken for a unit computation. 9/10/2002 Internet/Grid Computing - Fall 2002 16

Then Speed up = T S (comp) / (T P (comp) +T P (comm)) Speed up = T(s)/ (T(s)/N + number of messages x time for one message) Serial fraction is never actually zero. T P (comp) is almost never completely uniform. Message transmission time is never quite uniform. 9/10/2002 Internet/Grid Computing - Fall 2002 17

Simple Analytic Models for Execution Time Let T be the time to complete. Let T j be the time taken by the jth processor to execute its task. Let T j comp be the time take to execute the computation on processor j, etc. T j = (T j comp + T j comm + T j idle) T = max (T j ) 9/10/2002 Internet/Grid Computing - Fall 2002 18

Assume all of the processes and communication channels behave identically. Then P T = (1/P)( T j P comp + T j P comm + T j idle ) i =1 T = (1/P) ( T comp + T comm + T idle ) i =1 i =1 9/10/2002 Internet/Grid Computing - Fall 2002 19

Another metric one often hears bandied about is parallel efficiency. E relative = T 1 / (P T p ) E absolute = T 1 (best sequential) / (P T p ) Often efficiency is decreased by poor synchronization behaviors. 9/10/2002 Internet/Grid Computing - Fall 2002 20

Scalability What and Why? Scalability for Fixed Problem Size How many resources can be effectively applied to a computation of a given size? How does efficiency vary as P increases with N constant? Scalability for Increasing Problem Size How big a problem can I effectively solve with the what resources? What is the dependence of E on both N and P? A computation is scalable if E is constant as P increases. 9/10/2002 Internet/Grid Computing - Fall 2002 21

Parallel efficiency in partitioning of a 2-D grid. Efficiency as a function of communication cost. 1-D partitioning of a 2-D grid a) Partition 64 points among two processors. b) Partition 64 points among 4 processors. E decreases with respect to a) c) Partition 128 points among four processors. E is the same as for a) 9/10/2002 Internet/Grid Computing - Fall 2002 22

Partitioning and Scalability Example Finite Difference computation on a three D mesh of dimension NxNxZ. The computation is that each mesh point is the average of its neighboring mesh points. T(sequential) ~ t c NNZ Partition mesh into P partitions along one-dimension. T(comm,1D part.) = time to send one plane of mesh in each direction. T(comm) = 2P(t s + t w 2NZ) Assume P divides N and neglect idle time. T(parallel,1D part.) = t(comp) NNZ/P + 2 t s + t w 2NZ 9/10/2002 Internet/Grid Computing - Fall 2002 23

E relative (parallel,1d part.) = t p NxNxZ / (t c N 2 tz + 2t s P + 4NZPt w ) (T 1 =t c NxNxZ) E decreases with increasing P, t s, and t w. The asymptotic decrease with P is as 1/P. E increases with increasing N, Z and t c. Execution time decreases with increasing P but has a lower bound determined by T(comm). Execution time increases with increasing N, Z, t c,t s, and t w We conclude this computation as formulated is not scalable for constant N and Z. But let Z = N and both N and P increase to asymptotic values. Is the computation scalable? 9/10/2002 Internet/Grid Computing - Fall 2002 24

Scalability of the Finite Difference Computation with a Two- Dimensional Partitioning of the Mesh Each computation is of size (n/sqrt(p))(n/sqrt(p))z Each communication operation is of size 2(N/sqrt(P))Z 9/10/2002 Internet/Grid Computing - Fall 2002 25

Efficiency for the computation with two dimension partitioning is E relative (parallel,2d part.) = t p NxNxZ / (t c N 2 Z + 2t s P + 4NZsqrt(P) t w ) Let Z =N and N = SQRT(P) E relative (parallel,2d part.) = t p PxSQRT(P) / (t c PxSQRT(P) + 2t s P + 4PxSQRT(P) t w ) Is this computation scalable? 9/10/2002 Internet/Grid Computing - Fall 2002 26

Isoefficiency Function The isoefficiency function is the rate of growth of T(comp, sequential) as a function of P necessary to maintain a constant E. T(comp,sequential) ~ NxNxZ, for E(parallel, 1D part.) ~ t p NxNxZ / (t cz N 2 + 2t s P + 4NxNxPxt s ) For constant E, N ~ O(P) or the computational work must grow as as O(PxP). Or the isoefficiency function is ~ O(PxP) While the isoefficiency function for the 2D partitioning is derived as N~O(Sqrt(P)) isoefficiency ~O(P). 9/10/2002 Internet/Grid Computing - Fall 2002 27