Chapter 2. Parallel Architectures

Similar documents
COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Advanced Parallel Architecture. Annalisa Massini /2017

Lecture 2 Parallel Programming Platforms

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Scalability and Classifications

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS Parallel Algorithms in Scientific Computing

Interconnection Network

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Overview. Processor organizations Types of parallel machines. Real machines

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

COSC 6385 Computer Architecture - Multi Processor Systems

Computer parallelism Flynn s categories

Parallel Architectures

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Real Parallel Computers

Real Parallel Computers

What is Parallel Computing?

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Parallel Architecture. Sathish Vadhiyar

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Parallel Architectures

Chapter 9 Multiprocessors

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Distributed Systems. 01. Introduction. Paul Krzyzanowski. Rutgers University. Fall 2013

Parallel Computing Platforms

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architecture

PARALLEL COMPUTER ARCHITECTURES

4. Networks. in parallel computers. Advances in Computer Architecture

Crossbar switch. Chapter 2: Concepts and Architectures. Traditional Computer Architecture. Computer System Architectures. Flynn Architectures (2)

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

BİL 542 Parallel Computing

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Multiprocessors - Flynn s Taxonomy (1966)

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

SHARED MEMORY VS DISTRIBUTED MEMORY

WHY PARALLEL PROCESSING? (CE-401)

EE 4683/5683: COMPUTER ARCHITECTURE

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Introduction to parallel computing

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM

Lecture 9: MIMD Architectures

Convergence of Parallel Architecture

Parallel Computer Architecture II

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Interconnection Network

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

What are Clusters? Why Clusters? - a Short History

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Client Server & Distributed System. A Basic Introduction

Introduction to High-Performance Computing

ECE/CS 757: Advanced Computer Architecture II Interconnects

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Types of Parallel Computers

Parallel Architectures

Three basic multiprocessing issues

Lecture 7: Parallel Processing

Chapter 18 Parallel Processing

Computer Architecture

Interconnection networks

Cluster Network Products

Outline Marquette University

InfiniBand Experiences of PC²

Dheeraj Bhardwaj May 12, 2003

Distributed Systems. Thoai Nam Faculty of Computer Science and Engineering HCMC University of Technology

Networks. Distributed Systems. Philipp Kupferschmied. Universität Karlsruhe, System Architecture Group. May 6th, 2009

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CSC630/CSC730: Parallel Computing

Parallel Processors. Session 1 Introduction

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Interconnect Technology and Computational Speed

Top500 Supercomputer list

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

CSCI 4717 Computer Architecture

Multiprocessors & Thread Level Parallelism

TDT Appendix E Interconnection Networks

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Chapter 11. Introduction to Multiprocessors

Low Cost Supercomputing. Rajkumar Buyya, Monash University, Melbourne, Australia. Parallel Processing on Linux Clusters

Concepts of Distributed Systems 2006/2007

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Physical Organization of Parallel Platforms. Alexandre David

TDP3471 Distributed and Parallel Computing

Transcription:

Chapter 2 Parallel Architectures

2.1 Hardware-Architecture Flynn s Classification The most well known and simple classification scheme differentiates according to the multiplicity of instruction and data streams. The combination yields four classes: SD (single data) SI (single instruction) SISD: Von-Neumann- Computer MI (multiple instruction) MISD: Data flow machines MD (multiple data) SIMD: Vector computer, (Cray-1, CM2, MasPar, some GPUs) MIMD: Multiprocessor systems, Distributed Systems (Cray T3E, Cluster, most current computers..) 2-2

Flynn s Classification scheme Instructions CU ALU Instructions Data Instructions Instructions Instructions CU1 CU2 CUn Instructions Instructions Instructions ALU1 ALU2 ALUn Data Data Instructions Memory SISD CU Instructions Instructions Instructions ALU1 ALU2 ALUn Data Data Data Memory SIMD Memory MISD Instructions Instructions Instructions CU1 CU2 CUn Instructions Instructions Instructions ALU1 ALU2 ALUn Data Data Data Memory MIMD 2-3

MIMD Machines a) Multiprocessor systems (shared memory systems, tightly coupled systems) All processors have access to common memory modules via a common communication network CPU1 CPU2 CPU3 CPU4 Interconnection network Memory 1 Memory 2 Memory 3 Further distinction according to the accessibility of the memory modules: uniform memory access (UMA): Access to memory module is independent of address of processor an memory module non-uniform memory access (NUMA): Memory access depends on addresses involved. 2-4

MIMD Machines b) Multicomputer systems (message passing systems, loosely coupled systems) Each processor has its own private memory with exclusive access Data exchange takes place by sending messages across an interconnection network. Memory 1 Memory 2 Memory 3 Memory 4 CPU1 CPU2 CPU3 CPU4 Interconnection network 2-5

Pragmatic Classification (Parallel Computer) Parallel Computer Cluster Computer Metacomputer Tightly Coupled Constellations Vector Pile of PCs NOW WS Farms/cycle- Stealing Beowulf NT/Windows Clusters DSM SHMEM- NUMA 2-6

2.2 Interconnection networks General Criteria Extendibility Arbitrary increments Performance Short paths between all processors High bandwidth Short delay Cost Proportional to number of wires and to number of access points Reliability Existence of redundant data paths Functionality Buffering Routing Group communication 2-7

Topological Performance Aspects Node degree Def: Number of directly connected neighbor nodes Goal : constant, small (cost, scalability) Diameter Def: Maximum path length between two arbitrary nodes Goal : small (low maximum message delay) Edge connectivity Def: Minimum number of edges to be removed in order to partition the network Goal : high (bandwidth, fault tolerance, parallelism) Bisection bandwidth Def: Minimum sum of bandwidth of all sets of edges, which partition the network in two equal halves when removed Goal: high (bandwidth, fault tolerance, parallelism) 2-8

2.2.1 Static Networks Connectivity spectrum of static networks Ring Hypercube Completely meshed up No. of links n n /2 log 2 n n ( n -1) / 2 Accesses/Processors 2 log 2 n n -1 Diameter n /2 log 2 n 1 2-9

Grid Structures 4x4-Grid 4x4-Torus Properties Constant node degree Extendibility in small increments Good support of algorithms with local communication structure (modeling of physical processes, e.g. LaPlace heat distribution) 2-10

Hypercube Hypercube of dimension 4 Properties: Logarithmic diameter Extendibility in powers of 2 Variable node degree Connection Machine CM-2 2-11

Trees Binary tree X-Tree Properties: Logarithmic diameter Extendibility in powers of 2 Node degree at most 3 (at most 5, respectively) Poor bisection bandwidth No parallel data paths (binary tree) 2-12

Fat Tree n-ary Tree Higher Capacity (factor n) closer to the root removes bottleneck 16x Connection Machine CM-5 4x 1x 2-13

Cube Connected Cycles CCC(d) d-dimensional Hypercube, where each node is replaced by a ring of size d. Each of these d nodes has two links in the ring and one more to one of the d hypercube dimensions Cube Connected Cycles of dimension 3 (CCC(3)) Properties: Logarithmic diameter Constant node degree (3) Extendibility only in powers of 2 2-14

Butterfly-Graph Properties: Logarithmic diameter Extendibility in powers of 2 Constant node degree 4 2-15

Overview: Properties Graph Number of nodes Number of edges Max. Node degree Diameter d 1 a2 a d a k k = 1 a Grid G( ) d k = 1 ( a 1) k i k a i 2d d ( a k 1) k= 1 a a a d Torus T( ) 1 2 d k = 1 a k d d k = 1 a k 2d d a k / 2 k = 1 Hypercube H(d) d 2 d2 d 1 d d Cube Connected Cycles CCC(d) d d 1 d2 3d2 3 2d + d / 2 2-16

2.2.2 Dynamic Interconnection networks All components have access to a joint network. Connections are switched on request. Switchable network Scheme of a dynamic network There are basically three different classes Bus Crossbar switch Multistage networks 2-17

Bus-like Networks cost effective blocking extendible suitable for small number of components Single bus M M M M P P P P B us Multibus for multiprocessor architecture M M M M P P P P B 1 B 2 B m 2-18

Crossbar switch expensive non-blocking, highly performant fixed number of access points realizable only for small networks due to quadratically growing costs M M M M M M M M P P P P P P P P Interprocessor connection P P P P P P P P Connection between processors and memory modules 2-19

2.2.3 Multistage Networks Small crossbar switches (e.g. 2x2) serve as cells that are connected in stages to build larger networks. Properties partially blocking extendible (a) (b) (c) (d) Elementary switching states: through (a), cross (b), upper (c) and lower (d) broadcast 2-20

Example: Omega-Network 000 001 000 001 010 011 010 011 100 101 100 101 110 111 110 111 Target address determines selection of output for each stage (0: up, 1: down) 2-21

Example: Banyan-Network 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0 0 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 1 1 0 1 11 2-22

Dynamic Networks: Properties Classes of dynamic interconnection networks with their basic properties in comparison: Bus Multistage Network Crossbar switch Latency (distance) 1 log n 1 Bandwidth per access point 1/n < 1 1 Switching cost n n log n n 2 Wiring cost 1 n log n n Asymptotic growth of cost and performance features of dynamic interconnection networks 2-23

2.2.4 Local Networks mostly bus-like or ring-like topologies diverse media (e.g. coaxial cable, twisted pair, optical fiber, infrared, radio) diverse protocols (e.g. IEEE 802.3 (Ethernet, Fast- Ethernet, Gigabit-Ethernet), IEEE 802.5 (Token-Ring), FDDI, ATM, IEEE 802.11 (Wlan/WiFi), IEEE 802.15 (Bluetooth), IEEE 802.16 (WiMAX)) compute nodes typically heterogeneous regarding performance manufacturer operating system mostly hierarchical heterogeneous structure of subnetworks (structured networks) 2-24

Typical Intranet print and other servers Web server Local area network email server Desktop computers email server File server the rest of the Internet router/firewall print other servers 2-25

2.3 Cluster Architectures A Cluster is the collection of usually complete autonomous computers (Workstations, PCs) to build a parallel computer. (Some cluster nodes share infrastructure such as power supply, case, fans). The component computer nodes can but need not to be spatially tightly coupled. The coupling is realized by a high speed network. Networks (typically) used: Ethernet (Fast, Gigabit, 10 Gigabit, ) Myrinet SCI (Scalable Coherent Interface) Infiniband Source: Supermicro 2-26

Linux-Cluster (History) 1995 Beowulf-Cluster (NASA) PentiumPro/II as nodes, FastEthernet, MPI 1997 Avalon (Los Alamos) 70 Alpha-Processors 1998 Ad-hoc-Cluster of 560 nodes for one night (Paderborn) 1999 Siemens hpcline 192 Pentium II (Paderborn) 2000 Cplant-Cluster (Sandia National Laboratories) 1000 Alpha-Prozessoren 2001 Locus Supercluster 1416 Prozessoren 2002 Linux NetworX 2304 Prozessoren (Xeon). 2-27

Avalon-Cluster (Los Alamos Nat. Lab.) 140 Alpha-Processors 2-28

World record 1998: 560 nodes running Linux WDR-Computer night 1998, HNF / Uni Paderborn Spontaneously connected heterogeneous cluster of private PCs brought by visitors. Network: Gigabit-/Fast Ethernet 2-29

Supercomputer (Cluster) ASCI White ASCI Q ASCI Blue Pacific 2-30

Networks for Cluster Fast Ethernet Bandwidth: 100 Mb/s Latency: ca. 80 µsec (userlevel to userlevel) Was highly prevalent (low cost) Gigabit-Ethernet Bandwidth: 1 Gb/s Latency: ca. 80 µsec (userlevel to userlevel, raw: 1-12 µsec) Today widespread (159 of Top500, Nov 2012) 10 Gigabit-Ethernet Bandwidth: 10 Gbit/s Latency: ca. 80 µsec (userlevel to userlevel, raw: 2-4 µsec) Already in use (30 of Top500, Nov 2012) 2-31

Myrinet Bandwidth: 4 Gb/s Real : 495 MB/s Latency: ca. 5 µsec Point-to-Point Message Passing Switch-based Still used (3 in Top 500, Nov 2012, using 10G Myrinet) 2-32

Scalable Coherent Interface SCI = Scalable Coherent Interface, ANSI/IEEE 1596-1992. 64-bit global address space (16-bit node ID, 48-bit local memory). Guaranteed data transport by packet-based handshake protocol. Interfaces with two unidirectional Links that can work simultaneously. Parallel copper or serial optical fibers. Low latency (<2μs). Supports shared memory and message passing. Defines Cache coherence protocol (not available everywhere). 2-33

Scalable Coherent Interface (SCI) Standard IEEE 1594 Bandwidth: 667MB/s Real : 386MB/s Latency: 1,4 msec Shared Memory in HW Usually 2D/3D-Torus Scarcely used (last time Nov. 2005 in Top500) (Data from Dolphin Inc.) 2-34

SCI Nodes 1: Request Multiplxer. Bypass FIFO Output FIFO Request Output FIFO Response Input FIFO Request Input FIFO Response 2: Request Echo SCI out Multiplexer Bypass FIFO Address Dec. SCI in Address Dec. 4: Response Echo Intput FIFO Request Intput FIFO Response Output FIFO Request Output FIFO Response 3: Response 2-35

SCI-NUMA Processor Processor Processor Processor L1 L2 L1 L2 Cache L1 L2 L1 L2 Cache Processor Bus Processor Bus SCI Cache SCI Cache Memory SCI Memory SCI Bridge Bridge SCI-Ring 2-36

SCI for PC-Cluster (Dolphin-Adapter) Processor Processor Processor Processor L1 L2 L1 L2 Cache L1 L2 L1 L2 Cache Processor Bus Bus Bridge I/O Bus Processor Bus Bus Bridge I/O Bus Memory SCI Memory SCI Bridge Bridge SCI-Ring 2-37

Example: SCI-Cluster (Uni Paderborn) 96 Dual processor-pcs (Pentium III, 850 MHz) with total 48 GB memory, connected as SCI-Rings (16x6-Torus). NUMA-Architecture: Hardware access to remote memory Siemens hpc-line 2-38

SCI-Cluster (KBS, TU Berlin) Linux-Cluster 16 Dualprocessor-nodes with 1,7 GHz P4 1 GB memory 2D-Torus SCI (Scalable Coherent Interface) 2-39

SCI address translation Process 1 on node A Process 2 on node B 32 bit virtual address space MMU address translation 32 bit physical address space Real memory I/O-window SCI address translation 48 bit global SCI address space 2-40

Infiniband New standard for coupling devices and compute nodes (replacement for system busses like PCI) single link: 2,5 GBit/sec Up to 12 links in parallel Most used today (224 of Top500, Nov 2012, different configurations) Source: top500.org 2-41

Infiniband Current technology for building HPC- Cluster 2.5 Gbps per Link ca. 2 µsec Latency 2-42

Virtual Interface Architecture To standardize the diversity of high performance network protocols, an industry consortium, consisting of Microsoft, Compaq (now HP) and Intel defined the Virtual Interface Architecture (VIA). Central element is the concept of a user-level interface, that allows direct access to the network interface bypassing the operating system. Examples: Active Messages (UCB); Fast Messages (UIUC); U-Net (Cornell); BIP (Univ. Lyon, France); User Process Operating System Network Interface source: M. Baker 2-43

Virtual Interface Architecture Application OS Communication Interface (Sockets, MPI,...) VI User Agent user mode kernel mode VI Kernel Agent Open/Connect/ Register Memory S E N D R E C E I V E S E N D R E C E I V E Send/Receive S E N D R E C E I V E C O M P L E T I O N VI Network Adapter 2-44

Virtual Interface Architecture Application OS Communication Interface (Sockets, MPI,...) VI User Agent user mode kernel mode VI Kernel Agent Open/Connect/ Register Memory S E N D R E C E I V E S E N D R E C E I V E Send/Receive S E N D R E C E I V E C O M P L E T I O N VI Network Adapter 2-45

2.4 Architectural Trends in HPC Cost saving by usage of mass products Usage of top-of-line standard processor Usage of top-of-line standard interconnect SMPs/multicore/manycore-processors as nodes Assemble such nodes to build arbitrarily large clusters P P M P P M P P M P P... P P M M P P P P P P P P P P Interconnection Network 2-46

Top 500: Architecture 2-47

Notion of "Constellation" Nodes of parallel computer consist of many processors (SMP) or multicore processors Thus, we have two layers of parallelism: the "Intranode-parallelism" and the "Internode parallelism" In the TOP500 terminology there is a distinction between Cluster and Constellation, depending on what form parallelism is dominant: If a machine has more processor cores per node than nodes at all, it is called Constellation. If a machine has more nodes than processor cores per node, than it is called Cluster. 2-48

Top 500: Processor Types 2-49

Top 500: Coprocessors 2-50

Top 500: Processor Family 2-51

Top 500: Operating system families 2-52

HPC-Operating systems 2-53

Top 500: Interconnect 2-54

Top 500: Geographical Regions 2-55

Moore s law Number of transistors per chip doubles (roughly) every 18 months. Source: Wikipedia 2-56

Transistors per processor Processor Year of introduction Transistor count Intel 4004 1971 2,300 Intel 8008 1972 3,500 MOS 6502 1975 3,510 Zilog Z80 1976 8,500 Intel 8088 1979 29,000 Intel 80286 1982 134,000 Motorola 68020 1984 200,000 Intel 80386 1985 275,000 Motorola 68030 1987 273,000 Intel 80486 1989 1,180,000 Motorola 68040 1990 ~1,200,000 2-57

Transistors per processor Processor Year of introduction Transistor count Intel Pentium 1993 3,100,000 Intel Pentium Pro 1995 5,500,000 AMD K6 1997 8,800,000 AMD K7 1999 22,000,000 Intel Pentium 4 2000 42,000,000 AMD K8 2003 105,900,000 AMD K8 Dual Core 2005 243,000,000 IBM Power 6 2007 789,000,000 AMD Opteron (6 core) 2009 904,000,000 Intel Xeon EX (10 core) 2011 2,600,000,000 Intel Itanium (8 core) 2012 3,100,000,000 2-58

Xeon EX (Westmere EX) with 10 Cores Source: computerbase.de 2-59

Intel Single Chip Cloud Computer: 48 Cores Source: Intel 2-60

Performance of Top 500 2-61

Top 500: Forecast 2-62

2.4 Software Architecture of Distributed Systems Applications should be supported in a transparent way, i.e., we need a software infrastructure that hides the distribution to a high degree. Parallel distributed Application??? Computer Computer Computer Computer Interconnection network 2-63

Two Architectural Approaches for Distributed Systems Distributed Operating System The operating system itself provides the necessary functionality Parallel Application Distributed Operating System Computer Computer Computer Computer Interconnection network Network Operating System The local operating systems are complemented by an additional layer (Network OS, Middleware) that provides distributed functions Parallel Application Network Operating System loc. OS loc. OS loc. OS loc. OS Computer Computer Computer Computer Interconnection network 2-64

General OS Architecture (Microkernel) Control OS user interface Application Services Operation and management of real and logical resources Process area OS-Kernel Hardware Kernel interface Process management Interaction Kernel area 2-65

Architecture of local OS User processes OS-Processes Kernel interface OS kernel Hardware 2-66

Architecture of distributed OS User processes Node boundary OS-Processes OS-Kernel Hardware OS-Kernel Hardware Interconnect 2-67

Process communication A B A C Communication object B A B Send(C,..) Receive(C,..) Kernel C 2-68

Distributed OS: Architectural Variants To overcome the node boundaries in distributed systems for interaction, we need communication operations that are either integrated in the kernel (kernel federation) or as a component outside the kernel (process federation). When using the kernel federation the "distributedness" is hidden behind the kernel interface. Kernel calls can refer to any object in the system regardless of its physical location. The federation of all kernels is called the federated kernel. The process federation leaves the kernel interface untouched. The local kernel is not aware of being part of a distributed system. This way, existing OS can be extended to become distributed OS. 2-69

Kernel federation Node boundaries Process Process Comm. software loc. OS-Kernel Federated kernel Comm. software loc. OS-Kernel 2-70

Process federation Node boundaries Process Process Comm. software Comm. software OS-Kernel OS-Kernel 2-71

Features of Distributed OS Multiprocessor- OS Distribute d OS Network- OS All nodes have the same OS? yes yes no How many copies of OS? 1 n n Shared ready queue? yes no no Communication Shared memory Message exchange Message exchange / Shared files 2-72

Transparency I Transparency is the property that the user (almost) does not realize the distributedness of the system. Transparency is the main challenge when building distributed systems. Access transparency Access to remote objects in the same way as to local ones Location transparency Name transparency Objects are addressed using names that are independent of their locations User mobility If the user changes his location he can address the objects with the same name as before. Replication transparency If for performance improvement (temporary) copies of data are generated, the consistency is taken care of automatically. 2-73

Transparency II Failure transparency Failures of components should not be noticeable. Migration transparency Migration of objects should not be noticeable. Concurrency transparency The fact that that many users access objects from different locations concurrently must not lead to problems or faults. Scalability transparency Adding more nodes to the system during operation should be possible. Algorithms should cope with the growth of the system. Performance transparency No performance penalties due to uneven load distribution. 2-74

References Erich Strohmaier, Jack J. Dongarra, Hans W. Meuer, and Horst D. Simon: Recent Trends in the Marketplace of High Performance Computing, http://www.netlib.org/utk/people/jackdongarra/papers/marketplace-top500-pc.pdf Gregory Pfister: In Search of Clusters (2nd ed.), Prentice-Hall, 1998 Thorsten v.eicken, Werner Vogels: Evolution of the Virtual Interface Architecture, IEEE Computer, Nov. 1998 Hans Werner Meuer: The TOP500 Project: Looking Back over 15 Years of Supercomputing Experience http://www.top500.org/files/top500_looking_back_hwm.pdf David Culler, J.P. Singh, Anoop Gupta: Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1999 Hesham El-Rewini und Mostafa Abd-El-Barr: Advanced Computer Architecture and Parallel Processing: Advanced Computer Architecture v. 2; John Wiley & Sons, 2005 V. Rajaraman und C.Siva Ram Murthy: Parallel Computers: Architecture and Programming; Prentice-Hall, 2004 2-75