Architecture of Large Systems CS-602 Computer Science and Engineering Department National Institute of Technology

Size: px

Start display at page:

Download "Architecture of Large Systems CS-602 Computer Science and Engineering Department National Institute of Technology"

Bennett Allison
5 years ago
Views:

1 Architecture of Large Systems CS-602 Computer Science and Engineering Department National Institute of Technology Instructor: Dr. Lokesh Chouhan Slide Sources: Andrew S. Tanenbaum, Structured Computer Organization Moris Mano, Computer System and organization book adapted and supplemented

2 arallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmer? Does it translate into performance?

3 What level arallelism? Bit level parallelism: 1970 to ~ bits, 8 bit, 16 bit, 32 bit microprocessors Instruction level parallelism (IL): ~1985 through today ipelining Superscalar VLIW Out-of-Order execution Limits to benefits of IL? rocess Level or Thread level parallelism; mainstream for general purpose computing? Servers are parallel High-end Desktop dual processor C

4 Flynn's Taxonomy 1. Single Instruction stream, Single Data stream (SISD) 2. Single Instruction stream, Multiple Data streams (SIMD) 3. Multiple Instruction streams, Single Data stream (MISD) 4. Multiple Instruction streams, Multiple Data streams (MIMD) Abbreviations: S - Single M - Multiple I - D - Instruction Data

5 Single Instruction stream, Single Data stream (SISD) Sequential execution of instruction is processed by one CU. CU contains one E(processing element), i.e. one ALU under one control unit. This is the conventional serial computers.

6 Single Instruction stream, Single Data stream (SISD) A serial (non-parallel) computer Single Instruction: Only one instruction stream is being acted on by the CU during any one clock cycle Single Data: Only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer Examples: older generation mainframes, minicomputers and workstations; most modern day Cs.

7 Single Instruction stream, Multiple Data streams (SIMD) Same instruction is executed by different processors on different data stream. Instruction streams are broadcasted from same CU to different processors. Main memory is divided into multiple modules, generating unique data stream for each processors.

8 Single Instruction stream, Multiple Data streams (SIMD) A type of parallel computer Single Instruction: All processing units execute the same instruction at any given clock cycle Multiple Data: Each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. Synchronous (lockstep) and deterministic execution Two varieties: rocessor Arrays and Vector ipelines Examples: some early computers of this type: rocessor Arrays: Connection Machine CM-2, Masar M-1 & M-2, ILLIAC IV

9 Multiple Instruction streams, Single Data stream (MISD) Multiple processing elements are organized under the control of multiple control units. Each control unit is handling one instruction stream and processed through its corresponding processing element and processing elements handle one data stream. No commercially available are built using this organization.

10 Multiple Instruction streams, Single Data stream (MISD) A type of parallel computer Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams. Single Data: A single data stream is fed into multiple processing units. Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie- Mellon C.mmp computer (1971). Some conceivable uses might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message.

11 Multiple Instruction streams, Multiple Data streams (MIMD) Each processor fetches its own instruction streams and operates on own data stream. Memory is divided into different modules to generate unique data stream for each processing element. Each processing element has its own control unit Each processor can start or finish at different time, so everything runs asynchronously(unlike SIMD).

12 Multiple Instruction streams, Multiple Data streams (MIMD) A type of parallel computer Multiple Instruction: Every processor may be executing a different instruction stream Multiple Data: Every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or nondeterministic Currently, the most common type of parallel computer - most modern supercomputers fall into this category. Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SM computers, multi-core Cs. Note: many MIMD architectures also include SIMD execution sub-components

13 Taxonomy of arallel rocessor Architectures

14 Flynn-Johnson classification

23 Shared Address/Memory Model Communicate via Load and Store Oldest and most popular model Based on timesharing: processes on multiple processors vs. sharing single processor Single virtual and physical address space Multiple processes can overlap (share), but ALL threads share a process address space Writes to shared address space by one thread are visible to reads of other threads Usual model: share code, private stack, some shared heap, some private heap

24 Shared Address Model Summary Each processcan name all data it shares with other processes Data transfer via load and store Data size: byte, word,... or cache blocks Uses virtual memory to map virtual to local or remote physical Memory hierarchy model applies: now communication moves data to local proc. cache (as load moves data from memory to cache) Latency, BW (cache block?), scalability when communicate?

26 Message assing rogramming Model Naming rocesses can name private data directly. No shared address space Operations Explicit communication: send and receive Send transfers data from private address space to another process Receive copies data from process to private address space Must be able to name processes

27 Message assing rogramming Model Ordering rogram order within a process Send and receive can provide pt-to-pt synch between processes Mutual exclusion inherent Can construct global address space rocess number + address within process address space But no direct operations on these names

28 Message assing Model Whole computers (CU, memory, I/O devices) communicate as explicit I/O operations Sendspecifies local buffer + receiving process on remote computer Receivespecifies sending process on remote computer + local buffer to place data Usually send includes process tag and receive has rule on tag: match 1, match any Synch: when send completes, when buffer free, when request accepted, receive wait for send Send+receive => memory-memory copy, where each each supplies local address, AND does pair-wise synchronization!

Message assing Model Send+receive=> memory-memory copy, synchronization on OS even on 1 processor History of message passing: Network topology important because could only send to immediate neighbor

29 Message assing Model Send+receive=> memory-memory copy, synchronization on OS even on 1 processor History of message passing: Network topology important because could only send to immediate neighbor Typically synchronous, blocking send & receive Later DMA with non-blocking sends, DMA for receive into buffer until processor does receive, and then data is transferred to local memory Later SW libraries to allow arbitrary communication Example: IBM S-2, RS6000 workstations in racks Network Interface Card has Intel 960 8X8 Crossbar switch as communication building block 40 MBytes/sec per link

30 Communication Models Shared Memory rocessors communicate with shared address space Easy on small-scale machines Advantages: Model of choice for uniprocessors, small-scale Ms Ease of programming Lower latency Easier to use hardware controlled caching Message passing rocessors have private memories, communicate via messages Advantages: Less hardware, easier to design Focuses attention on costly non-local operations Can support either SW model on either HW base

31 Advantages shared-memory communication model Compatibility with SM hardware Ease of programming when communication patterns are complex or vary dynamically during execution Ability to develop apps using familiar SM model, attention only on performance critical accesses Lower communication overhead, better use of BW for small items, due to implicit communication and memory mapping to implement protection in hardware, rather than through I/O system HW-controlled caching to reduce remote comm. by caching of all data, both shared and private.

32 Advantages message-passing communication model The hardware can be simpler Communication explicit => simpler to understand; in shared memory it can be hard to know when communicating and when not, and how costly it is Explicit communication focuses attention on costly aspect of parallel computation, sometimes leading to improved structure in multiprocessor program Synchronization is naturally associated with sending messages, reducing the possibility for errors introduced by incorrect synchronization Easier to use sender-initiated communication, which may have some advantages in performance

35 Cluster

37 Constellation

39 erformance Measurements Speed of a supercomputer is generally denoted in FLOS (Floating oint Operations per second) MegaFlops(MFLOS), Million (10 6 )FLOS GigaFLOS(GFLOS), Billion (10 9 )FLOS TeraFLOS(TFLOS), Trillion (10 12 ) FLOS 39

40 Other Message passing Computers Cluster:Computers connected over highbandwidth local area network (Ethernet or Myrinet) used as a parallel computer Network of Workstations (NOW): Homogeneous cluster same type computers Grid:Computers connected over wide area network

41 MIMD System Architectures Finally, the architecture of a MIMD system, contrast to its topology, refers to its connections to its system memory. A systems may also be classified by their architectures. Two of these are: Uniform memory access (UMA) Nonuniform memory access (NUMA) 41

42 Uniform memory access (UMA) The UMA is a type of symmetric multiprocessor, or SM, that has two or more processors that perform symmetric functions. UMA gives all CUs equal (uniform) access to all memory locations in shared memory. They interact with shared memory by some communications mechanism like a simple bus or a complex multistage interconnection network. 42

43 Uniform memory access (UMA) Architecture rocessor 1 rocessor 2 Communications mechanism Shared Memory rocessor n 43

44 Nonuniformmemory access (NUMA) NUMA architectures, unlike UMA architectures do not allow uniform access to all shared memory locations. This architecture still allows all processors to access all shared memory locations but in a nonuniform way, each processor can access its local shared memory more quickly than the other memory modules not next to it. 44

45 Nonuniformmemory access (NUMA) Architecture rocessor 1 rocessor 2 rocessor n Memory 1 Memory 2 Memory n Communications mechanism 45

46 System Topologies Topologies A system may also be classified by its topology. A topology is the pattern of connections between processors. The cost-performance tradeoff determines which topologies to use for a multiprocessor system. 46

47 Topology Classification A topology is characterized by its diameter, total bandwidth, and bisection bandwidth Diameter the maximum distance between two processors in the computer system. Total bandwidth the capacity of a communications link multiplied by the number of such links in the system. Bisection bandwidth represents the maximum data transfer that could occur at the bottleneck in the topology. 47

48 Topology Classification Bus-based networks Ring - one of the simplest topologies Array topology Example: cube (in case of 3D array) Hypercube In ring the longest route between 2 processors is: /2 in hypercube : log 48

49 System Topologies Shared Bus Topology rocessors communicate with each other via a single bus that can only handle one data transmissions at a time. In most shared buses, processors directly communicate with their own local memory. M M Shared Bus Global memory M 49

50 System Topologies Ring Topology Uses direct connections between processors instead of a shared bus. Allows communication links to be active simultaneously but data may have to travel through several processors to reach its destination. 50

51 System Topologies Tree Topology Uses direct connections between processors; each having three connections. There is only one unique path between any pair of processors. 51

52 Systems Topologies Mesh Topology In the mesh topology, every processor connects to the processors above and below it, and to its right and left. 52

53 Systems Topologies Mesh Topology 53

54 System Topologies Hypercube Topology Is a multiple mesh topology. Each processor connects to all other processors whose binary values differ by one bit. For example, processor 0(0000) connects to 1(0001) or 2(0010). 54

55 System Topologies Completely Connected Topology Every processor has n-1 connections, one to each of the other processors. There is an increase in complexity as the system grows but this offers maximum communication capabilities. 55

57 SINGLE CLUSTER OF ANUAM Node 0 master Node 1 slave Node 2 slave Node 15 slave 860/X,50MHZ 128 KB-512KB cache & 64-Mb- 256 Mb memory MULTIBUS II D I S K S SCSI Ethernet other systems X2 5 TC TERMINALS 57

58 64 NODE ANUAM CONFIGURATION Y MB II BUS wscsi wscsi MB II BUS X X MB II BUS wscsi wscsi MB II BUS Y 58

59 8 NODE ANUAM CONFIGURATION 59

60 ANUAM ALICATIONS Finite Element Analysis rotein Structures 64-Node ANUAM ressure Contour in LCA Duct 3-D lasma Simulations August 23,

61 Key Benefits Simple to use ANUAM uses user familiar Unix environment with large memory & specially designed parallelizing tools No parallel language needed SIM parallel simulator runs on any Unix based system Scalable and processor independent 61

62 Bus based architecture Dynamic Interconnection network providing full connectivity and high speed and TC/I support Simple and general purpose industry back-plain bus Easily available off-the-shelf, low cost MultiBus, VME Bus, Futurebus many solutions Disadvantages One communication at a time Limited scalability of applications in bus based systems Lengthy development cycle for specialized hardware i860, Multibus-II reaching end of line, so radical change in architecture was needed 62

63 Networks of workstations (NOW) New concept in parallel computing and parallel computers Nodes are full-fledged workstations having cpu, memory, disks, OS etc. Interconnection through commodity networks like Ethernet, ATM, FDDI etc. Reduced Development Cycle, mostly restricted to software Switched Network topology 63

64 Typical ANUAM x86 Cluster ULINK NODE-01 NODE-02 NODE-03 NODE-04 NODE-05 NODE-06 NODE-07 NODE-08 FILE SERVER FAST ETHERNET SWITCH NODE-09 NODE-10 NODE-11 NODE-12 NODE-13 NODE-14 NODE-15 NODE-16 CAT-5 CABLE

65 Trends in Clustering Clustering is not a new idea, it has become affordable, can be build easily (plug&play) now. Even Small colleges have it. 65

66 Cluster based Systems Clustering is replacing all traditional Computing platforms and can be configured depending on the method and applied areas LB Cluster -Network load distribution and LB HA Cluster - Increase the Availability of systems HC Cluster (Scientific Cluster) - Computation-intensive Web farms - Increase HTT/SEC Rendering Cluster Increase Graphics speed HC : High erformance Computing HA : High Availability LB : Load Balancing 66

67 ANUAM -Xeon arallel Supercomputer No. Of nodes : 128 Compute Node: rocessor:dual Intel entium 2.4 GHz Memory :2 GB per node File Server: Dual Intel based with RAID 5, 360 GB Interconnection networks: 64 bit Scalable Coherent Interface (2D Torus connectivity) Software: Linux, MI, VM, ANULIB Anulib Tools: SIM, FFLOW, SYN, RE, S_TRACE Benchmarked erformance: Linpack 362 GFLOS for High erformance 67

2002, Sustained speed on 64 -IV cpus : 72 GFLOS Sustained

68 ANUAM clusters ANU64 ASHVA Sustained speed on 84 -III processors: 15 GFLOS Year of introduction : ARUNA Year: 2002, Sustained speed on 64 -IV cpus : 72 GFLOS Sustained speed on 128 Xeon processors :- 365 GFLOS Year of introduction :

69 Issues in building large clusters Scalability of interconnection network Scalability of software components Communication Libraries I/O Subsystem Cluster Management Tools Applications Installation and Management rocedures Troubleshooting rocedures 69

70 Other Issues in operating large clusters Space Management Node form factor Layout of the nodes Cable routing and weight ower Management Cooling arrangements 70

71 GRID CONCET User Access oint Result Resource Broker Grid Resources 71

72 Are Grids a Solution? Grid Computing means different things to different people. Goals of Grid Computing Reduce computing costs Increase computing resources Reduce job turnaround time Enable parametric analyses Reduce Complexity to Users Increase roductivity Technology Issues Clusters Internet infrastructure M solver adoption Administration of desktop Use middleware to automate Virtual Computing Centre Dependable, consistent, pervasive access to resources 72

73 What is needed? Computational Resources IS Reply Clusters M Choice Workstations MI, VM,Condor... Client -RC like Matlab Mathematica C, Fortran Java, erl Java GUI Request G a t e k e e p e r Scheduler Database Broker 73

74 Typical current grid Virtual organisations negotiate with sites to agree access to resources Grid middleware runs on each shared resource to provide INTERNET Data services Computation services Single sign-on Distributed services (both people and middleware) enable the grid E-infrastructure is the key!!! August 23, 2006 Talk at SASTRA 74

75 GARUDA Department of Information Technology (DIT), Govt. of India, has funded CDAC to deploy computational grid named GARUDA as roof of Concept project. It will connect 45 institutes in 17 cities in the country at 10/100 Mbps bandwidth. 75

76 Other Grids in India EU-IndiaGrid (ERNET, C-DAC, BARC,TIFR,SIN,UNE UNIV, NBCS) Coordination with Geant for Education Research DAE/DST/ERNET MOU for Tier II LHC Grid (10 Univ) BARC MOU with INFN, Italy to setup Grid research Hub C-DAC s GARUDA Grid Talk about Bio-Grid and Weather-Grid 76

78 Cray XC30 Supercomputer eak performance 2.57 etaflops/sec Each cabinet has 3 chassis; each chassis has 16 compute blades, each compute blade has 4 dual socket nodes rocessor:intel Xeon E5-2697v2 12C 2.7GHz

79 Intel Supercomputer XC 30

80 Supercomputers

81 World Fastest Computer Tianhe-2 (MilkyWay-2) -TH-IVB-FE Cluster, Intel Xeon E C 2.200GHz, TH Express- 2, Intel Xeon hi 31S1

82 Tianhe-2 (MilkyWay-2)

Processor Architecture and Interconnect

Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing