PEPE: A Trace-Driven Simulator to Evaluate. Recongurable Multicomputer Architectures? Campus Universitario, Albacete, Spain

Similar documents
Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

A Formal View of Multicomputers. Jose A. Galludy, Jose M. Garcazand Francisco J. Quilesy

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia

Deadlock- and Livelock-Free Routing Protocols for Wave Switching

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS*

A Hybrid Interconnection Network for Integrated Communication Services

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Physical Input Links. Small Buffers

An Architecture Workbench for Multicomputers

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster

Software-Based Deadlock Recovery Technique for True Fully Adaptive Routing in Wormhole Networks

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Dept. of Computer Science, Keio University. Dept. of Information and Computer Science, Kanagawa Institute of Technology

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

Part IV. Chapter 15 - Introduction to MIMD Architectures

Ecube Planar adaptive Turn model (west-first non-minimal)

Under Bursty Trac. Ludmila Cherkasova, Al Davis, Vadim Kotov, Ian Robinson, Tomas Rokicki. Hewlett-Packard Laboratories Page Mill Road

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

3-ary 2-cube. processor. consumption channels. injection channels. router

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

Congestion Management in Lossless Interconnects: Challenges and Benefits

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

Dynamic Network Reconfiguration for Switch-based Networks

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science

Boosting the Performance of Myrinet Networks

Abstract. provide substantial improvements in performance on a per application basis. We have used architectural customization

University of Castilla-La Mancha

Balance of Processing and Communication using Sparse Networks

Global Scheduler. Global Issue. Global Retire

Processor. Flit Buffer. Router

4. Networks. in parallel computers. Advances in Computer Architecture

Research on outlier intrusion detection technologybased on data mining

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ

Evaluation of NOC Using Tightly Coupled Router Architecture

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s.

Impact of the Head-of-Line Blocking on Parallel Computer Networks: Hardware to Applications? V. Puente, J.A. Gregorio, C. Izu 1, R. Beivide Universida

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

Basic Low Level Concepts

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Lecture 18: Communication Models and Architectures: Interconnection Networks

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract

Abstract Studying network protocols and distributed applications in real networks can be dicult due to the need for complex topologies, hard to nd phy

Generic Methodologies for Deadlock-Free Routing

Concurrent/Parallel Processing

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

Lecture: Interconnection Networks

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

IBM Almaden Research Center, at regular intervals to deliver smooth playback of video streams. A video-on-demand

cell rate bandwidth exploited by ABR and UBR CBR and VBR time

A programming environment to control switching. networks based on STC104 packet routing chip 1

director executor user program user program signal, breakpoint function call communication channel client library directing server

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract

Basic Switch Organization

SRAM SRAM SRAM SRAM EPF 10K130V EPF 10K130V. Ethernet DRAM DRAM DRAM EPROM EPF 10K130V EPF 10K130V. Flash DRAM DRAM

\Classical" RSVP and IP over ATM. Steven Berson. April 10, Abstract

Fault-Tolerant Routing in Fault Blocks. Planarly Constructed. Dong Xiang, Jia-Guang Sun, Jie. and Krishnaiyan Thulasiraman. Abstract.

2 Application Support via Proxies Onion Routing can be used with applications that are proxy-aware, as well as several non-proxy-aware applications, w

IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 7, JULY Applying In-Transit Buffers to Boost the Performance of Networks with Source Routing

A Client Oriented, IP Level Redirection Mechanism. Sumit Gupta. Dept. of Elec. Engg. College Station, TX Abstract

Packet Switch Architecture

Packet Switch Architecture

Do! environment. DoT

The members of the Committee approve the thesis of Baosheng Cai defended on March David B. Whalley Professor Directing Thesis Xin Yuan Commit

A Router Architecture for Real-Time Point-to-Point Networks. Jennifer Rexford, John Hall, and Kang G. Shin

Numerical Evaluation of Hierarchical QoS Routing. Sungjoon Ahn, Gayathri Chittiappa, A. Udaya Shankar. Computer Science Department and UMIACS

Application Programmer. Vienna Fortran Out-of-Core Program

The Adaptive Bubble Router 1

Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof

Rajendra V. Boppana. Computer Science Division. for example, [23, 25] and the references therein) exploit the

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS

University of Castilla-La Mancha

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Design and Implementation of a. August 4, Christopher Frank Joerg

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters

Fault-Tolerant Design of Wavelength-Routed Optical. Networks. S. Ramamurthy and Biswanath Mukherjee

On the Use of Multicast Delivery to Provide. a Scalable and Interactive Video-on-Demand Service. Kevin C. Almeroth. Mostafa H.

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA

Client 1. Client 2. out. Tuple Space (CB400, $5400) (Z400, $4800) removed from tuple space (Z400, $4800) remains in tuple space (CB400, $5400)

suitable for real-time applications. In this paper, we add a layer of Real-Time Communication Control (RTCC) protocol on top of Ethernet. The RTCC pro

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Relative Reduced Hops

Transcription:

PEPE: A Trace-Driven Simulator to Evaluate Recongurable Multicomputer Architectures? Jose M Garca 1, Jose LSanchez 2,Pascual Gonzalez 2 1 Universidad de Murcia, Facultad de Informatica Campus de Espinardo, 30071 Murcia, Spain jmgarcia@difumes 2 Universidad de Castilla-La Mancha, Escuela Politecnica Campus Universitario, 02071 Albacete, Spain fjsanchez,pgonzalezg@info-abuclmes Abstract Recent research on parallel systems with distributed memory has shown that the most dicult problem for system designers and users is related with the interconnection network In this paper, we describe a programming and evaluating tool for multicomputers, named PEPE It allows the execution of parallel programs and the evaluation of the network architecture PEPE takes a parallel program as input and generates a communication trace obtained from this program Next, PEPE simulates and evaluates the behaviour of a multicomputer architecture for this trace The most important parameters of the multicomputer can be adjusted by the user PEPE generates performance estimates and quality measures for the interconnection network Another important feature of this tool is that it allows us to evaluate networks whose topology is recon- gurable Recongurable networks are good alternatives to the classical approach However, only recently this idea became the focus of much interest, due to technological developments (optical interconnection) that made it more viable A recongurable network yields a variety of possible topologies for the network and enables the program to exploit this topological variety to speed up the computation 1 Introduction The growing demand for high processing power in various scientic and engineering applications has made multiprocessor architectures increasingly popular This is exemplied by the proliferation of a variety of parallel machines with some diverse design philosophies This diversity in architectural design has created a need for developing performance models and simulators for multiprocessors, not only to analyze the eectiveness of a design, but also to reduce the design time Distributed memory multiprocessors (often called multicomputers) are increasingly being used for providing high levels of performance for scientic applications The distributed memory machines oer signicant advantages over their shared memory counterparts in terms of cost and scalability, but it is a? This work was supported in part by CICYT under Grant TIC94-0510-C02-02

widely accepted fact that they are much more dicult to program than shared memory machines As a result, the programmer has to distribute code and data on processors himself, and manage communication among processes (or tasks) explicitly Simulators provide many advantages over running directly on a multiprocessor, including the cost eectiveness of workstations, the ability to exploit powerful sequential debuggers, the support for non-intrusive data collection and invariant checking, and the versatility of simulation Several simulators as Pie [12] or Paret [11] have been developed Usually these projects are aimed at developing programming environments and tools for the programmer, rather than tools designed to evaluate the architecture of the system Newer high-performance simulators such as TangoLite [4] or Proteus [2] consider the architectural support of the system showing several results about the performance of parallel machines A very interesting way of simulation is to evaluate the behaviour of an architecture using a trace taken from adequated algorithms Trace-driven simulations, which evaluate network performance on actual communication streams taken from characteristic programs, are the most reliable way for network design evaluation These simulations require a great computation power, because of the many dierent design possibilities that must be simulated, and because of the length of the communication traces that drive the simulation We have developed a simulator to evaluate the main features of the interconnection network, because in this way it can more faithfully represent the hardware implementation, taking into account details like channel multiplexing, partial buering and delays in blocked messages Furthermore, we are very interested in evaluating some new features in parallel machines, mainly recongurable architectures, that is, multicomputers whose networks can change their topology dynamically In this paper we describe our environment called PEPE (this acronym stands for Programming Environment for Parallel Execution) PEPE provides a userfriendly visual interface for all phases of parallel program development, ie parallel algorithm design, coding, debugging, task mapping, execution control and evaluation of some architecture parameters Our environment is dierent from previous work because of two major features First, PEPE allows us to evaluate the network for a communication trace taken from proper scientic problems we have previously coded in the environment That is, we can evaluate and modify the behaviour of the interconnection network for real problems and not only for predetermined workloads Second, it allows us to evaluate the performance of a multicomputer with a recongurable interconnection network A completely connected interconnection network can match the communication requirements ofanyapplication, but itistoo expensive to build even for a moderate number of nodes Recongurable interconnection networks are alternatives to complete connections This paradigm is suitable with either electronic or optical interconnections, which are applicable to a large class of networks Some researchers believe that the immediate goal in the development of computer networks should be hybrid optical-electronic systems which combine the advantages of both electronic and optical technologies while avoiding their

disadvantages Recongurable networks are especially suitable when these technologies are combined Up to now, we only know another related environment for parallel programming on recongurable multicomputers It is described in [1] This environment is devoted to phase-recongurable programs, that is, programs must be implemented as series of phases, and each phase is assumed to be separated from another one by synchronization - reconguration points These points select the adequated topology for each phase Additionally, this tool is language-oriented Our environment focuses on testing the dierent parameters of dynamically recongurable networks A dynamically recongurable network means that a network can vary its topology arbitrarily at runtime In this approach, any arbitrary topology is allowed, so that the interconnection network can easily match the communication requirements of a given algorithm In our environment, we can study in depth the main concepts and options about dynamically recongurable networks, their limitations and tradeos The rest of the paper is arranged as follows In section 2 the overall environment is described and its dierent parts are shown The programming style and several issues related to it are detailed in section 3 The structure of the traces is discussed in section 4 In section 5, we present the network simulator and explain the main results we can obtain with it Finally, we outline some conclusions 2 An Overview of PEPE In this section we are going to present PEPE The key features of this environment are simplicityand completeness Our main goal has been to obtain a exible system which allows us an easy and ecient way to evaluate the multicomputer recongurable architecture Our environment has been developed on a workstation using C-language PEPE provides a user-friendly visual interface for all phases of parallel program development and tunning This graphical interface (gure 1) has been designed with the aim of keeping it really comfortable to the user, following the styles adopted nowadays by most of the human-oriented interfaces [9] In our environment, the user gets tools for easy experimentation with both, dierent parallelization possibilities and dierent network parameters With this methodology, the programmer can analyze very quickly several parallelization strategies and evaluate these strategies with tools for performance analysis PEPE simulates the execution of a parallel algorithm at two levels At the rst level, we are interested in verifying the behaviour of the parallel algorithm and studying the dierent strategies of parallelization, as well as the problems that arise when the parallel algorithms are coded and executed, such as deadlock, livelock, etc For this, the execution of the parallel algorithm is simulated over a virtual architecture At the second level, the behaviour of the recongurable network is studied For this, we start with the communication pattern produced by the parallel algorithm and the performing of the network for this communication pattern is simulated At this level, PEPE allows us to use dierent parameters

Fig 1 PEPE's graphical interface for the recongurable network These levels give rise to the two phases of the simulator PEPE has two main phases and several modules within it The rst phase is more language-oriented, and it allows us to code, simulate and optimize a parallel program In this phase, interactive tools for specication, coding, compiling, debugging and testing were developed This phase is architecture independent The second phase has several tools for mappingand evaluating the recongurable architecture We can vary several network parameters such as interconnection topology and routing algorithm In gure 2 we show the modules of PEPE The link between the phases is an intermediate code that is generated as an optional result of the rst phase This intermediate code is a trace of the communication pattern of the source program This allows the user to use the environment as a whole or each phase singly For example, we can execute only the rst phase for testing the parallel behaviour of an algorithm on an ideal multicomputer We can also obtain an intermediate code from a key parallel algorithm Then, we can execute several times the second phase from this trace with dierent network parameters to evaluate and tune the network for this key algorithm

INTERMEDIATE PHASE 1 PHASE 2 FRONT - END BACK - END PARALLEL SOURCE PROGRAM EDIT COMPILE DEBUG EXECUTION OPTIMIZATION PROCESS CODE MAPPING NETWORK INTERMEDIATE CODE NETWORK SIMULATOR PARALLEL ALGORITHM RESULTS NETWORK PERFORMANCE RESULTS Fig 2 Modules of PEPE 3 The Programming Tool In this section we outline some important features of the rst phase of PEPE In this module we try to overcome the diculty of programming multicomputers with an integrated approach and virtual concepts The programming tool aims at increasing programming productivity It takes a user parallel program as its input and tries to simulate its parallel execution on a virtual multicomputer In this way,the user can test his parallel algorithm and, in case (s)he is not satised, come back to code a new parallel version In this phase it is important to abstract from specic architectural details such as topology, number of processors, etc This module supports the concepts of virtual processor and virtual network The user must take this into account, since the correct execution of a program must not depend upon the topology of the interconnection network Usually,the parallel programmingstyle for most of these systems corresponds to the SPMD model [10], in which each processor asynchronously executes the same program but operates on distinct data items PEPE uses the SPMD model and an extension of Pascal for coding parallel algorithms This parallel language [7] which we have developed, is based on standard Pascal with some extensions to allow an easy and elegant programming of parallel algorithms, consisting of processes which communicate by means of message-passing Finally, PEPE uses static scheduling All processes must be created before starting execution For debugging parallel programs, PEPE's environment oers a parallel debugger With this debugger the programmer gets a global view of the parallel system Some features are breakpoints, the inspection of program states, displaying the

contents of data structures and the state of each process or the modication of the contents of data structures 4 The Communication Trace At the end of the rst phase, we can optionally generate an intermediate code (or trace) to evaluate the network performance for the parallel algorithm That is, the network performance can be evaluated from the communication pattern obtained from a parallel algorithm The trace records the set of messages that must be sent through the network As we will see, this trace is independent of the timing parameters of the network The trace contain a complete information for each message It consists of ve elds whose meaning is detailed below Additionally,each message has a message identier not included in the trace It is equal to the row number where it appears in the trace a) Source Process Indicates the process that sent this message b) Destination Process Indicates the process for which the message is destined c) Message Length This eld indicates the number of data bytes It does not include control information d) Injection Time Indicates the instant at which the message was injected into the network This value is given in clock cycles (clock frecuency is a simulator parameter) This value can be absolute or relative as detailed below e) Dependency This eld indicates if a message is dependent on the reception of another message or independent If it is independent, the value of this eld is -1 otherwise, its value indicates the identier of the message on which it depends Next, we are going to explain the last two elds in detail When a parallel program starts execution, some processes send messages Upon reception of those messages, some processes perform some computations, eventually sending more messages Thus, there are two types of messages in a parallel algorithm: dependent and independent A message is independent of any other message if a process can send this message without having to wait for the arrival of another message On the opposite side,a message m is dependent on another message m' when a process p receives message m', performs some computations and sends message m This dependency arises either because message m makes use of the information contained in message m', or because process p has no way to reach the instruction that sends message m without receiving message m' before, even if m does not use the information in m' Please, note that in many cases this dependency cannot be statically resolved by the compiler, and it is necessary to wait until execution time to know which ones are the dependencies between messages Thus, communication traces must be generated during the simulation of the execution of the parallel algorithm Figure 3 shows the code for a process in three dierent cases In the rst case, the message is sent independently of any other message, because the process does

Receive (message) if a > 0 then else Send (message) Receive (message) Send (message) Send (message) a) Independent b) Dependent c) Intermediate case Fig 3 Types of messages: dependent andindependent messages not need to receive any other message to execute the send instruction Case b) shows a dependent message the process has to receive amessage After processing it, the send instruction is executed In case c), the dependency between messages is determined at run time In this case, depending on the value of the variable a, the message to send will be dependent or independent As the communication trace of the algorithm is generated during the simulated execution, this trace will always be correctly generated for the dierent input values The value of the eld that contains the time at which a message is sent can be absolute or relative An absolute value indicates the momentat which a message is sent At this time, the network simulator will inject the message into the network Obviously, for independent messages, this eld will always contain an absolute value On the other hand, a relative value indicates the time since the arrival of a certain message until the departure of the message on which it depends Therefore, the network simulator will spend a time equal to this value between the arrival of a message at a node and the injection of the dependent message For dependent messages, the injection time is always taken as a relative value The injection time is obtain by computing the time that the simulated processor needs to execute the instructions before the instruction send (absolute value), or between a pair of dependent instructions send and receive (relative value) Our environment allows us to choose among the execution times of some commercial processors like transputers and others The use of dependent messages allows us to use the same traces to simulate dierent network parameters 5 The Network Simulator Next, we are going to describe the second phase of our environment, the interconnection network simulator The unique feature included in our simulator is that it permits the network to be dynamically recongured A recongurable network presents some advantages, the most interesting one being that it can

easily match the network topology to the communication requirements of a given program, properly exploitingthe locality in communications moreover, programming a parallel application becomes more independent of the target architecture because the interconnection network adapts tothe application dynamically In our simulator we can evaluate the performance of the interconnection network for parallel applications and not only for synthetic workloads This allows us to vary the parameters of the recongurable network and study how to improve its behaviour in real cases This phase of the simulator consists of two modules, the mapping module and the network simulator module We are going to detail the features of each one of them MAPPING NETWORK SIMULATOR Source Node 0 Dest Node 18 Message Length 16 Injection Time 0 Dependency -1 5 8 9 34 57 63 16 32 64 0 9 0-1 30 56 Fig 4 Intermediate code after the mapping The mapping module is the rst one of the second phase and is responsible for translating the process-oriented communication trace to a processor-oriented intermediate code according to some pre-dened mapping functions The intermediate code output by this module is slightly dierent from the communication trace Now, the rst two elds are related to processors (source and destination processor) instaed of processes The remainingcode isunchanged Figure 4 shows an example of this intermediate code With this module, we can evaluate several mapping algorithms for a given parallel algorithm The last module is properly the network simulator It is an improvedversion of a previous simulator [6] that supports network reconguration It can simulate at the it level dierent topologies and network sizes up to 16K nodes The topology of the network isdenable by the user Each node consists of a processor, its local memory, a router, a crossbar and several channels Message reception is buered, allowingthe storage of messages independently of the processes that must receive them The simulator takes into account memory contention, limiting the number of messages that can be sent or received simultaneously Also, messages crossing a node do not consume any memory bandwidth Wormhole routing is usedin wormhole routing [5]a message is descomposed

into small data units (called its) The header it governs the route As the header advances along the specied route, the remaining its follow in a pipeline fashion The pipelined nature of wormhole routing makes the message latency largely insensitive to the distance in the message-passing network The crossbar allows multiple messages to traverse a node simultaneously without interference It is congured by the router each time a successful routing is made It takes one clock cycle to transfer a it from an input queue to an output queue Physical channels have a bandwidth equal to one it per clock cycle and can be split into up to four virtual channels Each virtual channel has queues of equal size at both ends The total queue size associated with each physical channel is held constant Virtual channels are assigned the physical channel cyclically, only if they can transfer a it So, channel bandwidth is shared among the virtual channels requesting it It should be noted that blocked messages and messages waiting for the router do not consume any channel bandwidth The most important performance measures obtained with our environment are delay, latency and throughput Delay is the additional latency required to transfer a message with respect to an idle network It is measured in clock cycles The message latency lasts since the message is injected into the network until the last it is received at the destination node An idle network means a network without message trac and, thus, without channel multiplexing Throughput is usually dened as the maximum amount of information delivered per time unit The network reconguration is transparent tothe user, being handled by several reconguration algorithms that are executed as part of the run-time kernel of each node This class of reconguration is not restricted to a particular application, being very well suited for parallel applications whose communication pattern varies over time The goal of the network reconguration is to reduce the congestion of the network For this, when the trac between a pair of nodes is intense, the reconguration algorithm will try to put the source node close to the destination node to reduce the trac through the network and, therefore, to reduce the congestion that may have been produced The reconguration algorithm decides when a change must be carried out bymeans of a cost function The network reconguration is carried out in a decentralized way, that is, each node is responsible for trying to nd its best position in the network depending on the model of communication Also, reconguration is limited, preserving the original topology There are several dierent parameters [8] that can be varied to adjust how the reconguration is performed With recongurable networks, we want toreduce the latency and delay andto increase the throughput Also, we want to have a small number of changes to keep the reconguration cost low The quality of each reconguration is measured by the simulator 6 Conclusions Application development via high-performance simulation oers many advantages Simulators can provide a exible, cost-eective, interactive debugging en-

vironment that combines traditional debugging features with completely nonintrusive data collection We have presented an environment for evaluating the performance of multicomputers Our environment, unlike most of the earlier work, captures both the communication pattern of a given algorithm and the most important features of the interconnection network, allowing even the dynamic reconguration of the network This feature is a valid alternative to solve the communication bottleneck problem By means of using a recongurable topology, the principle of locality in communications is exploited, leading to an improvement in network latency and throughput Until now, the study of these features was dicult because there were no tools that permited varying the dierent parameters of recongurable networks In this paper, a system that solves this problem and opens a way for studying recongurable networks has been presented A recongurable network yields a variety of possible topologies for the network and enables the program to exploit this topological variety in order to speed up the computation [3] References 1 Adamo, JM, Trejo, L: Programming environment for phase-recongurable parallel programming on supernode Journal of Parallel and Distributed Computing 23 (1994) 278{292 2 Brewer, EA, Dellarocas, CN, Colbrook, A, Weihl, WE: Proteus: A highperformance parallel-architecture simulator In Proc 1992 ACM Sigmetrics and Performance '92 Conference, (1992) 247{248 3 Ben-Asher, Y, Peleg, D, Ramaswami, R, Schuster, A: The power of reconguration Journal of Parallel and Distributed Computing, 13 (1991) 139{153 4 Davis, H, Goldschmidt, SR, Hennesy, J: Multiprocessor simulation and tracing using Tango In Proc of the Int Conf on Parallel Processing, (1991) II99{II107 5 Dally, WJ, Seitz, CL: Deadlock-free message-routing in multiprocessor interconnection networks IEEE Transactions on Computers, C-36, No 5 (1987) 547{553 6 Duato, J: A new theory of deadlock-free adaptive routing in wormhole networks IEEE Transactions on Parallel and Distributed Systems, 4, No 11 (1993) 1{12 7 Garca, JM: A new language for multicomputer programming SIGPLAN Notices, 6 (1992) 47{53 8 Garca, JM, Duato, J: Dynamic reconguration of multicomputer networks: Limitations and tradeos Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society Press, (1993) 317-323 9 Hartson, HR, Hix, D: Human-computer interface development: Concepts and systems ACM Computing Surveys, 21, No 1 (1989) 10 Karp, A: Programming for parallelism IEEE Computer, (1987) 43{57 11 Nichols, KM, Edmark, JT: Modeling multicomputer systems with PARET IEEE Computer, (1988) 39{48 12 Segall, Z, Rudolph, L: PIE: A programming and instrumentation environment for parallel processing IEEE Software, (1985) 22{37 This article was processed using the LATEX macro package with LLNCS style