Talk based on material by Google

Size: px

Start display at page:

Download "Talk based on material by Google"

Laura Webster
5 years ago
Views:

1 Talk based on material by Google

2 Block II: Cluster/Grid/Cloud Programming & The Message Passing Interfaces (MPI) Clusters History, Architectures, Programming Concepts, Scheduling, Components, Middleware, Single System Image, Resource management, Programming Environments & Tools, Applications, Message Passing, Load-balancing, Distributed Shared-memory, Parallel I/O Grids History, Technologies, Programming Concepts, Grid Projects, Open Standards, Resource, Protocol, Network Enabled Service, API, SDK, Syntax, Hourglass Model, Grid Layers, The Globus Toolkit, Data Grid, Portals, Resource managers, Scheduling, Security, Economy Patterns, Projects, proteomics.net sli de 2

3 History Remote Procedure Calls (RPC) Message Passing Interface (MPI)

4 Rajkumar Buyya

5 Taxonomy based on how processors, memory & interconnect are laid out, resources are managed Massively Parallel Processors (MPP) Symmetric Multiprocessors (SMP) Cache-Coherent Non-Uniform Memory Access (CC-NUMA) Clusters Distributed Systems Grids/P2P

6 MPP A large parallel processing system with a sharednothing architecture Consist of several hundred nodes with a high-speed interconnection network/switch Each node consists of a main memory & one or more processors Runs a separate copy of the OS SMP 2-64 processors today Shared-everything architecture All processors share all the global resources available Single copy of the OS runs on these systems

CC-NUMA a scalable multiprocessor system having a cache-coherent nonuniform memory access architecture every processor has a global view of all of the memory Clusters a collection of workstations /

7 CC-NUMA a scalable multiprocessor system having a cache-coherent nonuniform memory access architecture every processor has a global view of all of the memory Clusters a collection of workstations / PCs that are interconnected by a high-speed network work as an integrated collection of resources have a single system image spanning all its nodes Distributed systems considered conventional networks of independent computers have multiple system images as each node runs its own OS the individual machines could be combinations of MPPs, SMPs, clusters, & individual computers

Vector Computers (VC) - proprietary system: provided the breakthrough needed for the emergence of computational science, buy they were only a partial answer.

8 Vector Computers (VC) - proprietary system: provided the breakthrough needed for the emergence of computational science, buy they were only a partial answer. Massively Parallel Processors (MPP) -proprietary systems: high cost and a low performance/price ratio. Symmetric Multiprocessors (SMP): suffers from scalability Distributed Systems: difficult to use and hard to extract parallel performance. Clusters - gaining popularity: High Performance Computing - Commodity Supercomputing High Availability Computing - Mission Critical Applications

10 ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research (SGI?Tera) Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar Elxsi ETA Systems Evans & Sutherland Computer Division Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Convex C4600 Guiltech Intel Scientific Computers Intl. Parallel Machines KSR MasPar Meiko Myrias Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Suprenum

12 Network of Workstations

13 The promise of supercomputing to the average PC User?

14 Performance of PC/Workstations components has almost reached performance of those used in supercomputers Microprocessors (50% to 100% per year) Networks (Gigabit SANs); Operating Systems (Linux,...); Programming environment (MPI, ); Applications (.edu,.com,.org,.net,.shop,.bank); The rate of performance improvements of commodity systems is much rapid compared to specialized systems.

Linking together two or more computers to jointly solve computational problems Since the early 1990s, an increasing trend to move away from expensive and specialized proprietary parallel

15 Linking together two or more computers to jointly solve computational problems Since the early 1990s, an increasing trend to move away from expensive and specialized proprietary parallel supercomputers towards clusters of workstations Hard to find money to buy expensive systems The rapid improvement in the availability of commodity high performance components for workstations and networks Low-cost commodity supercomputing From specialized traditional supercomputing platforms to cheaper, general purpose systems consisting of loosely coupled components built up from single or multiprocessor PCs or workstations

16 PDA Clusters s

17 A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource. A node: a single or multiprocessor system with memory, I/O facilities, & OS A cluster: generally 2 or more computers (nodes) connected together in a single cabinet, or physically separated & connected via a LAN appear as a single system to users and applications provide a cost-effective way to gain features and benefits

18 Sequential Applications Sequential Applications Sequential Applications Parallel Applications Parallel Applications Parallel Applications Parallel Programming Environment Cluster Middleware (Single System Image and Availability Infrastructure) PC/Workstation PC/Workstation PC/Workstation PC/Workstation Communications Software Communications Software Communications Software Communications Software Network Interface Hardware Network Interface Hardware Network Interface Hardware Network Interface Hardware Cluster Interconnection Network/Switch

19 Commodity Parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node virtual memory scheduler files Nodes can be used individually or jointly...

20 Parallel Processing Use multiple processors to build MPP/DSM-like systems for parallel computing Network RAM Use memory associated with each workstation as aggregate DRAM cache Software RAID Redundant array of inexpensive disks Use the arrays of workstation disks to provide cheap, highly available and scalable file storage Possible to provide parallel I/O support to applications Multipath Communication Use multiple networks for parallel data transfer between nodes MPP: Massively Parallel Processing DSP: Distributed Shared Memory

21 Cluster Design Issues Enhanced Performance low cost) Enhanced Availability (failure management) Single System Image (look-and-feel of one system) Size Scalability (physical & application) Fast Communication (networks & protocols) Load Balancing (CPU, Net, Memory, Disk) Security and Encryption (clusters of clusters) Distributed Environment (Social issues) Manageability (admin. And control) Programmability (simple API if required) Applicability (cluster-aware and non-aware app.)

22 High Performance (dedicated). High Throughput (idle cycle harvesting). High Availability (fail-over). A Unified System HP and HA within the same cluster

24 Shared Pool of Computing Resources: Processors, Memory, Disks Interconnect Guarantee at least one workstation to many individuals (when active) Deliver large % of collective resources to few individuals at any one time

26 Best of both Worlds: (world is heading towards this configuration)

27 P C P shared queue C P C Producers Consumers Work queues allow threads from one task to send processing work to another task in a decoupled fashion

28 P C separate machines P network shared queue C P C To make this work in a distributed setting, we would like this to simply happen over the network

29 Where does the queue live? How do you access it? (custom protocol? a generic memory-sharing protocol?) How do you guarantee that it doesn't become a bottleneck / source of deadlock?... Some well-defined solutions exist to support inter-machine programming, which we'll see next

31 Regular client-server protocols involve sending data back and forth according to a shared state Client: HTTP/1.0 index.html GET HTTP/1.0 hello.gif GET Server: 200 OK Length: 2400 (file data) 200 OK Length: 81494

32 RPC servers will call arbitrary functions in dll, exe, with arguments passed over the network, and return values back over network Client: foo.dll,bar(4, 10, hello ) foo.dll,baz(42) Server: returned_string err: no such function

33 RPC can be used with two basic interfaces: synchronous and asynchronous Synchronous RPC is a remote function call client blocks and waits for return val Asynchronous RPC is a remote thread spawn

35 client server h = Spawn(server_name, foo.dll, long_runner, x, y ) RPC dispatcher (More code... keeps running ) time foo.dll: String long_runner(x, y) { return new GiantObject(); } GiantObject myobj = Sync(h);

37 Writing rpc_call(foo.dll, bar, arg0, arg1..) is poor form Confusing code Breaks abstraction Wrapper stub function makes code cleaner bar(arg0, arg1); //programmer writes this; // makes RPC under the hood

38 Who can call RPC functions? Anybody? How do you handle multiple versions of a function? Need to marshal objects How do you handle error conditions? Numerous protocols: DCOM, CORBA, JRMI

39 Imagine a Beowulf cluster of these -- common Slashdot meme

40 Traditional cluster computing involves explicitly forming a cluster from computer nodes and dispatching jobs Beowulf is a style of system that links Linux machines together MPI (Message Passing Interface) describes an API for allowing programs to communicate with their parallel components

41 Makes a cluster of computers present a single computer interface One computer is the master Starts tasks User terminal / external network is connected to this machine Several worker nodes form backend; not usually individually accessed

42 Runs on commodity PCs Uses standard Ethernet network (though faster networks can be used too) Open-source software

43 Beowulf is an architecture style It is not itself an explicit library Client nodes are set up in very dumb fashion Use NFS to share file system with master User starts programs on master machine Scripts use rsh to invoke subprograms on worker nodes

44 If you need several totally isolated jobs done in parallel, the above is all you need Most systems require more inter-thread communication than Beowulf offers Special libraries make this easier

45 MPI is an API that allows programs running on multiple computers to interoperate MPI itself is a standard; implementations of it exist in C and Fortran Provides synchronization and communication operations to processes

46 Messages are sequences of bytes moving between processes The sender and receiver must agree on the type structure of values in the message Marshalling : data layout so that there is no ambiguity such as four chars v. one integer. Mateti, Linux Clusters 46

47 Process A sends a data buffer as a message to process B. Process B waits for a message from A, and when it arrives copies it into its own local memory. No memory shared between A and B. Mateti, Linux Clusters 47

48 Obviously, Messages cannot be received before they are sent. A receiver waits until there is a message. Asynchronous Sender never blocks, even if infinitely many messages are waiting to be received Semi-asynchronous is a practical version of above with large but finite amount of buffering Mateti, Linux Clusters 48

49 Q: send(m, P) Send message M to process P P: recv(x, Q) Receive message from process Q, and place it in variable x The message data Type of x must match that of m As if x := m Mateti, Linux Clusters 49

50 One sender Q, multiple receivers P Not all receivers may receive at the same time Q: broadcast (m) Send message M to processes P: recv(x, Q) Receive message from process Q, and place it in variable x Mateti, Linux Clusters 50

51 Sender blocks until receiver is ready to receive. Cannot send messages to self. No buffering. Mateti, Linux Clusters 51

52 Sender never blocks. Receiver receives when ready. Can send messages to self. Infinite buffering. Mateti, Linux Clusters 52

53 Speed not so good Sender copies message into system buffers. Message travels the network. Receiver copies message from system buffers into local memory. Special virtual memory techniques help. Programming Quality less error-prone cf. shared memory Mateti, Linux Clusters 53

54 User explicitly spawns child processes to do work MPI library aware of the size of the universe the number of available machines MPI system will spawn processes on different machines Do not need to be the same executable

55 MPI programs define a Window of a certain size as a shared memory region Multiple processes attach to the window Get() and Put() primitives copy data into the shared memory asynchronously Fence() command blocks until all users of the window reach the fence, at which point their shared memories are consistent User is responsible for ensuring that stale data is not read from shared memory buffer

56 Supports intuitive notion of barriers with Fence() Mutual exclusion locks also supported Library ensures that multiple machines cannot access the lock at the same time Ensuring that failed nodes cannot deadlock an entire distributed process will increase system complexity

57 Basic communication unit in MPI is a message a piece of data sent from one machine to another MPI provides message-sending and receiving functions that allow processes to exchange messages in a thread-safe fashion over the network Also includes multi-party messages...

58 1:n broadcast one process sends a message to all processes in a group n:1 reduce all processes in a group send data to a designated process which merges the data n:n messaging communication also supported

59 One process in a group can send a message which all group members receive (e.g., a global stop processing signal)

60 Processes in a group can all report data together (asynchronously) which is gathered into a single message reported to one process (e.g., reporting results of a distributed computation)

61 Combination of above paradigms; individual processes contribute components to a global message which reaches all group members

62 Programmers have very explicit control over data manipulation; allows high performance applications Trade-off is that it has a steep learning curve Systems such as MapReduce are considerably lower learning curve (but cannot handle as complex of system interactions)

63 Generic RPC and shared-memory libraries allow flexible definition of software systems Require programmers to think hard about how the network is involved in the process Systems such as MapReduce (next lecture) automate much of the lower-level intermachine communication, in exchange for some inflexibility of design

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.