What are Clusters? Why Clusters? - a Short History

Similar documents
Multi-core Programming - Introduction

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC 6385 Computer Architecture - Multi Processor Systems

Dheeraj Bhardwaj May 12, 2003

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architecture

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer Architecture

Introduction II. Overview

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Parallel Computing Platforms

CS Parallel Algorithms in Scientific Computing

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Introduction to Parallel Computing

BİL 542 Parallel Computing

Lecture 2 Parallel Programming Platforms

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Dr. Joe Zhang PDC-3: Parallel Platforms


Issues in Multiprocessors

Lecture 3: Intro to parallel machines and models

Lecture 7: Parallel Processing

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Three basic multiprocessing issues

WHY PARALLEL PROCESSING? (CE-401)

Module 5 Introduction to Parallel Processing Systems

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Top500 Supercomputer list

Commodity Cluster Computing

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Real Parallel Computers

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Chapter 18 Parallel Processing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Parallel Computer Architecture II

Cray XE6 Performance Workshop

Parallel Architectures

Issues in Multiprocessors

ECE 574 Cluster Computing Lecture 1

Fundamentals of Quantitative Design and Analysis

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Node Hardware. Performance Convergence

Parallel Architectures

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Unit 9 : Fundamentals of Parallel Processing

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Outline. Course Administration /6.338/SMA5505. Parallel Machines in Applications Special Approaches Our Class Computer.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Convergence of Parallel Architecture

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

High Performance Computing Course Notes HPC Fundamentals

ECE/CS 757: Advanced Computer Architecture II Interconnects

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

NOW Handout Page 1 NO! Today s Goal: CS 258 Parallel Computer Architecture. What will you get out of CS258? Will it be worthwhile?

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

The Red Storm System: Architecture, System Update and Performance Analysis

Fujitsu s Approach to Application Centric Petascale Computing

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009

Approaches to Parallel Computing

Real Parallel Computers

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

COSC 6374 Parallel Computation. Parallel Computer Architectures

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

6.1 Multiprocessor Computing Environment

Moore s Law. Computer architect goal Software developer assumption

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chapter 20: Database System Architectures

CS 475: Parallel Programming Introduction

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

COSC4201 Multiprocessors

BlueGene/L (No. 4 in the Latest Top500 List)

Parallel Computing. Hwansoo Han (SKKU)

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Application Performance on Dual Processor Cluster Nodes

High Performance Computing Course Notes Course Administration

Multiprocessors - Flynn s Taxonomy (1966)

Parallel and High Performance Computing CSE 745

RISC Processors and Parallel Processing. Section and 3.3.6

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Linux Clusters for High- Performance Computing: An Introduction

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

PARALLEL ARCHITECTURES

Hakam Zaidan Stephen Moore

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Computing Ideas

Parallel Computers. c R. Leduc

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Transcription:

What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by all processors in (and only in ) the node, possibly peripherals such as disks, and are connected to other nodes by a network. Nodes with more than one processor is called an SMP (symmetric multiprocessor) node. Clusters can be do-it-yourself or packaged Cluster Computing 1 Why Clusters? - a Short History Same reasons as supercomputers were needed Need for increased computational power to solve larger more complex problems large run times or real time constraints large memory usage high I/O usage Fault tolerance (web or scientific computation) Interest in low cost alternative to expensive parallel machines Cluster Computing 2 Kent State University 1

Taxonomy of Parallel Architectures Arranged by tightness of coupling i.e. latency Systolic special hardware implementation of algorithms, signal processors, FPGA Vector pipelining of arithmetic operations (ALU) and memory bank accesses (Cray) SIMD (Associative) Single Instruction Multiple Data, same instruction applied to data on each of many processors (CM-1, MPP, Staran, Aspro, Wavetracer) Dataflow fine grained asynchronous flow control depedning on data precedence constraints PIM (processor-in-memory) combine memory and ALU on one circuit die. Gives high memory bandwidth and low latency Cluster Computing 3 Taxonomy of Parallel Architectures MIMD (Multiple Instruction Multiple Data) execute different instruction on different data MPP (Massively Parallel Processors) Distributed memory (Intel Paragon) Shared Memory w/o coherent caches (BBN Butterfly, T3E) CC-NUMA [cache coherent non-uniform memory archicture] (HP Exemplar, SGI Origin 2000) Clusters ensemble of commodity components connected by an interconnection network within a single administrative domain and usually in one room (Geographically) Distributed Systems exploit available cycles (Grid, DSI, Entropia, SETI@home) Cluster Computing 4 Kent State University 2

Taxonomy of Clusters Workstation Clusters (Networks of Workstations, NOW) (Sun, SGI) Beowulf class clusters ensembles of PCs integrated with COTS (commercial off the shelf e.g. ethernet) or SAN (system area e.g. Myrinet) networks Cluster farms existing LANs and PCs which when idle can be used to perform work Superclusters clusters of clusters, within a LAN or a campus, perhaps using the campus network for interconnection. Cluster Computing 5 Clusters - Deployment Originally research groups within National Labs and Universities used for computational science not just Math & CS research Now being deployed in supercomputer centers & used commercially OSC recently replaced IBM SP2 by a Compaq cluster Smaller ones can be afforded by departments and research groups Cluster Computing 6 Kent State University 3

Vector computers From Cray to Beowulf Parallel Computers shared memory, bus based (SGI Origin 2000) distributed memory, interconnection network based (IBM SP2) Network of Workstations (Sun, HP, IBM, DEC) - possibly shared use NOW (Berkeley), COW (Wisconsin) PC (Beowulf) Cluster originally dedicated use Beowulf (CESDIS, Goddard Flight Center, 1994) Possibly SMP nodes Cluster Computing 7 Evolution of Parallel Machines Originally Parallel Machines had custom chips (CPU), custom bus/interconnection network, custom I/O system proprietary compiler or library More recently parallel machines have custom bus/interconnection network and possibly I/O system standard chips standard compilers (f90) or library (MPI) Cluster Computing 8 Kent State University 4

Cluster Advantages Capability Scaling Convergence Architecture - standard Price/Performance Flexibility of Configuration and Upgrade Technology Tracking High Availability redundancy Personal Empowerment Development Cost and Time for Manufacturers Cluster Computing 9 Application Requirements Computational Requirements Number of floating point operations Compare with attainable performance for types of operations in calculation not with peak Mflops rating of manufacturer Mflops is often only speed in cache (more later) Memory Cache / Main memory / Virtual Memory (on disk) Virtual memory not used if high performance needed I/O Often need large data output NFS issues (low performance and concurrency issues) Cluster Computing 10 Kent State University 5

Application Requirements - Parallelism Embarassingly (or pleasingly) parallel : Can be broken into multiple tasks that need little or no communication Parameter study, web server Applications requiring explicit parallelism Need to evaluate whether can be run effectively on cluster Metrics used: Latency (minimum time to send a message) Overhead (time CPU spends in sending message) Bandwidth Contention (for shared resource e.g. network/memory bus) Cluster Computing 11 T = s + r n T time to transfer n bytes s is latency Simplest Model r = 1/B where B is bandwidth Typical numbers for clusters s from 5 to 100 microsecs r from 0.01 to 0.1 microsecs/byte Note 2GHz processor can do a new fp instruction every 0.00005 microsecs Easy to see that need significant processing between communications Cluster Computing 12 Kent State University 6

Estimating Application Requirements 3D PDE in cube, N points per side = N^3 points Time marching 6 flops per mesh point Each mesh point given by 4 values (x,y,z) and f Let N = 1024, need 2 time steps in memory Data size = 2 x 4 x (1024)^3 = 8 Gwords = 64 Gbytes Work per step = 6 x (1024)^3 = 6 Gflops Need for cluster Memory size exceeds availability on single node Memory size exceeds 4 Gbytes addressable by 32-bits Work seems reasonable for CPU ( now approx 6 Gflops buy that is peak rate!!) Cluster Computing 13 Real Performance Suppose 2GHz is clock rate Consider code for (i=0; i<n; i++) c[i] = a[i] * b[i] 2 loads and 1 store of double 2 billion per sec requires moving 3 x 8 x 10^9 = 24 GB/sec from memory Not available in commodity nodes more like 0.2 to ` GB/sec Achievable performance only 1 to 4 % of peak More about this later Cluster Computing 14 Kent State University 7

Domain Decomposition p processors Break into sub-domains (subcubes) of N/p 1/3 points per side Values from neighboring pieces needed (N/p 1/3 ) 2 per side. Now have communication overhead Time T now T = N 3 f/p + 6 (s + r N 2 /p 2/3 ) Even with infinite p have T = 6s Cluster Computing 15 Cluster for computation T = N 3 f/p + 6 (s + r N 2 /p 2/3 ) Minimize cost 2GB memory per node -> at least 32 nodes Real time constraint given T solve for p Floating point work must be large compared with communication implies N 3 f/p > 6s or p < N 3 f/6s For typical s/f and N = 1024 limits p to a few thousand nodes But for N = 128 and fast ethernet means p < 10 Moral : Be careful don t try to use too many nodes Cluster Computing 16 Kent State University 8

Performance Conclusions Cluster useful could not do on single processor due to memory size needed Expected performance small % of peak If enough nodes might fit in cache and get superlinear speedup (not likely) Latency key here but bandwidth may be in others If we record each time step need to do 64 GB of output, so need high performance I/O system Note problem was 3D rarely worth doing 2D ones in parallel Cluster Computing 17 Choosing Cluster 1. Understand application needs 2. Decide number and type of nodes Uni-processor or SMP Processor type (64 or 32 bit, fp or integer performance) Memory system 3. Decide on Network Low latency or high bandwidth needs 4. Determine physical needs Floor space, power, cooling 5. Determine Operating System Cluster Computing 18 Kent State University 9

Choosing Cluster Cost Tradeoffs Newest fastest nodes more expensive per flop Cost Irrelevant : buy fastest Total computing power goal: mid to low-end nodes but replace frequently (18 months) Specific computational power needed for application (e.g. to achieve computation rate) then analyze trade-offs Cluster Computing 19 Reading Parallel Supercomputing with Commodity Components, by Michael S. Warren, Donald J. Becker, M. Patrick Goda, John K. Salmon, and Thomas Sterling, PDPTA97. Original Beowulf paper? A Case for NOW (Networks of Workstations), by Thomas E. Anderson, David E. Culler, David A. Patterson, and the NOW Team, IEEE Micro, Feb, 1995. Cluster Computing 20 Kent State University 10