Eden Trace Viewer: A Tool to Visualize Parallel Functional Program Executions

Size: px

Start display at page:

Download "Eden Trace Viewer: A Tool to Visualize Parallel Functional Program Executions"

Randall Lee
6 years ago
Views:

1 Eden Trace Viewer: A Tool to Visualize Parallel Functional Program Executions Pablo Roldán Gómez roldan@informatik.uni-marburg.de Universidad Complutense de Madrid Facultad de Informática E Madrid, Spain Jost Berthold berthold@informatik.uni-marburg.de Philipps-Universität Marburg, Fachbereich Mathematik und Informatik D Marburg, Germany PRELIMINARY VERSION submitted to PACT 2005 conference Abstract This paper presents the Eden Trace Viewer (EdenTV), a tool which generates graphic representations of parallel program executions. The tracing tool was conceived to analyze the behavior of programs written in the parallel functional language Eden. With the high abstraction offered by parallel functional languages, tools for the analysis of the runtime behaviour are all the more important. We describe the concepts of Trace Generation, Trace Interpretation and Trace Representation in the EdenTV, giving brief technical details about its implementation. The functionality of the EdenTV is illustrated by two examples, emphasizing the utility of the EdenTV to analyze and improve the performance of parallel programs. An area of future work is to improve and optimize the tool in future versions, as well as to adapt it for other parallel functional languages. I. INTRODUCTION Due to the high complexity and nondeterministic interaction in parallel systems, optimizing parallel programs is a very complex process. Parallel functional languages eliminate many of the most unpleasant burdens of parallel programming[... ] (foreword of [5]) because they remain abstract about the operational properties of a program. Algorithmic skeletons([2], [16]) capture typical patterns of parallelism, which can often be expressed in the language itself (offering concepts like higher-order functions or demand-driven evaluation, which makes infinite data structures possible), but parallel functional languages [leave] a large gap for the implementation to bridge [5] In the area of parallel programming, and especially optimization, the high abstraction offered by functional languages makes parallel programming much less error-prone, but on the other hand, it is very difficult to know or guess what really happens inside the parallel machine during a program execution. The programmer has few possibilities to analyze and optimize the behavior of a program by only knowing the runtime or speedup achieved. In order to improve the performance of parallel functional programs productively, we need more information about the execution. The common way to analyze the performance of a program is to use profiling tools. Specific information about the behavior of a program is collected during the execution and often written in files for a subsequent study. After the execution, we have a file with data about which events have occurred, when and where. That is what we call trace generation. Afterwards, the collected information is statistically processed by an analysis software (trace interpretation) and the result can be (graphically or textually) presented to the programmer (trace representation). Toolkits exist to support all steps of this procedure, but for high-level languages, the profiling tools often need to implement specific language concepts not present in standard profiling and monitoring tools. We opted to use parts of the well established Pablo Toolkit [17] and its Self-Describing Data Format (SDDF) [1] for trace generation and only customize the subsequent steps in a special software. Trace files in a specialized SDDF are generated by an instrumented runtime system which uses the Pablo Trace Library. The generated files are then loaded into the Eden Trace Viewer (EdenTV), our interpretation and visualization tool programmed in Java which produces interactive bar diagrams for all computational units of the parallel functional language Eden. SDDF contains portable, semantically enriched data (following the same concept as XML), EdenTV can be used on any OS and hardware which supports Java. To understand the target language for this tool and its concepts, Section II gives a short description of Eden, with the basics about language concepts and runtime system (RTS). The next section of the paper (III) explains the components of our profiling system in the three steps mentioned above: First, we look at the trace generation and describe the data which is stored during the execution of a program. Then we describe the trace interpretation step, and how trace data is stored by the EdenTV in order to build the graphics. The last part of the section refers to the representation of the stored data and talks about the style of the EdenTV graphics. Section IV describes features of the EdenTV and shows

2 two example analyses. Section V points to related work, and the last section (VI) concludes and discusses extensions and improvements of EdenTV as future work. II. EDEN LANGUAGE AND IMPLEMENTATION CONCEPTS The target language of the tool is Eden [11]. This language extends the purely functional language Haskell[13] with syntactic constructs for explicitly defining and creating parallel processes in machines of a closed parallel system. The programmer has direct control over process granularity, data distribution and communication topology [8], but does not have to manage synchronization and data exchange between processes, which is done by the runtime system through implicit communication channels, transparent to the programmer. A. Coordination Constructs The essential two coordination constructs of Eden are process abstraction and instantiation. Functions can be embedded into process abstractions and a process instantiation operator starts a process by supplying input to the abstraction. process :: (Trans a, Trans b) => (a -> b) -> Process a b embeds functions of type a->b into process abstractions of type Process a b where the context (Trans a, Trans b) states that both a and b are overloaded values belonging to the Trans class of transmissible values. A process abstraction process (\x -> e) defines the behavior of a process with parameter x as input and expression e as output. A process instantiation uses the predefined infix operator ( # ) :: (Trans a, Trans b) => Process a b -> a -> b to provide a process abstraction with actual input parameters. Processes are distinguished from functions by their operational property to be executed remotely, while their denotational meaning remains unchanged as compared to the underlying function. Haskell, the computation language of Eden, uses lazy evaluation i.e. data is only evaluated on demand, if it is needed for the overall result. Eden deviates from this principle: Eden processes are eagerly instantiated and evaluate and send their output eagerly. This aims at increasing the parallelism degree and at speeding up the distribution of the computation. Local garbage collection detects unnecessary results and stops the evaluating remote threads. B. Process Communication In general, Eden processes do not share data among each other and are encapsulated units of computation. All data is communicated eagerly via (internal) channels, avoiding the need for global memory management and data request messages, but possibly duplicating data. Evaluation of the expression (process funct) # arg leads to the creation of a new process for evaluating the application of the function funct to the argument arg. The argument is evaluated by new concurrent threads in the parent process and sent to the new child process, which, in turn, fully evaluates and sends back the result of the function application. Both are using implicit 1:1 communication channels established between child and parent process on process instantiation. Communication is encoded in the type class Trans (Transmissible) shown in the type signatures above. As a matter of principle, every transmitted value is evaluated to normal form (NF) prior to transmission. Furthermore, the class uses overloaded communication functions for lists, which are transmitted as streams, element by element, and for tuples, which are evaluated component-wise by concurrent threads in the same process. An Eden process can thus contain a variable number of threads during its lifetime, supplying input to child processes or concurrently evaluating results. As communication channels are normally connections between parent and child process, the communication topologies are hierarchical. In order to support other topologies, Eden provides additional language constructs to create channels dynamically, which is another source of concurrent threads in the sender process. C. Runtime System Eden is implemented on the basis of the Glasgow Haskell Compiler GHC [12], a mature and efficient Haskell implementation. While the compiler frontend is almost unchanged, Eden uses a modified runtime system (RTS), where kernel parts are shared with GUM, the implementation of Glasgow Parallel Haskell (GpH) [20]. The parallel RTS uses suitable middleware (currently PVM[15]) to manage parallel execution and communication. Besides, the implementation of Eden relies essentially on the GHC implementation of concurrency. Concurrent threads, modelled in GHC by Thread State Objects (TSOs), are the basic unit for the implementation, so the central task for profiling is to keep track of their execution. TSOs in GHC are scheduled Round-Robin, depending on the memory consumption, which leads to the possible state transitions shown in Figure 1. new thread kill thread Fig. 1. Runnable Finished run thread suspend thread Running kill thread kill thread deblock thread block thread Thread state transition diagram in Eden Blocked A thread can be in one of the following states: runnable, running, blocked and finished. The first state of a thread after its creation is runnable, i.e. it can be executed, but the RTS has not chosen it yet. When the RTS schedules it for execution, the thread goes to state running, and possibly back to state runnable when descheduled. If a running thread needs a value which is not available, it blocks on a placeholder node in the graph heap. When the missing value arrives, the thread goes back to state runnable again. Finally, a thread finishes after

3 evaluating and sending its output, or when its output is not required any more (detected by a local or remote garbage collection). The channels between processes are modelled by inports and outports in the RTS, both administered by respective tables to ensure the 1:1 condition. A channel is thus a connection from an outport to exactly one inport. An Eden process, as a purely conceptual unit, consists of a number of concurrent threads which share a common graph heap (as opposed to processes, which only communicate via channels). A process table keeps track of the thread count in a process and its number of open ports. The Eden RTS does not support the migration of threads or processes to other machines during execution, so every thread is located on exactly one machine during its lifetime. The first task for profiling is thus to keep track of the thread state transitions within a process, and of communication between processes. III. IMPLEMENTATION OF THE EDENTV As explained in Section I, the typical steps of profiling usually are trace generation, trace interpretation and trace representation. The EdenTV separates these steps as much as possible, so that single parts can easily be maintained and modified for other purposes. The overall aim of EdenTV is a post-mortem analysis, thus the steps are also temporarily separated. A. First Step: Trace Generation To profile the execution of an Eden program, we must collect information about the behavior of machines, processes, threads and messages by writing selected events into a trace file. Usually, a trace event is considered as an instance of the execution of a specific statement or instruction in an application, on a specific machine. In our case, an event indicates the creation or a state transition of a unit of computation, machine, process or thread. Table 2 contains the events selected for the instrumentation, which have self-explanatory names. Fig. 2. Start Machine New Process New Thread Run Thread Block Thread Send Message End Machine Kill Process Kill Thread Suspend Thread Deblock Thread Receive Message Events to capture during the execution of an Eden program As already mentioned in Section I, we use the Pablo Toolkit [17] and its Self-Defining Data Format (SDDF) [1] developed at the University of Illinois. The Pablo Trace Capture Library provides an API of tracing methods for Fortran and C that can be called from the target code to generate trace files. The Eden RTS is instrumented with these calls to give us the necessary information about the execution. This is important, because it means that we do not have to change the Eden program itself # 201: "Block Thread" { int "timestamp"[] double "Seconds" int "Event ID" int "Machine Number" int "Process ID" int "Thread ID" int "Inport ID" int "Block Reason" };; Fig. 3. #217: "Send Message" { int "timestamp"[] double "Seconds" int "Event ID" int "Machine Number" int "Sending Process ID" int "Outport ID" int "Receiving Machine"; int "Receiving Process" int "Inport ID" int "Tag of the message" };; Examples of SDDF data record structures to obtain the traces. The instrumentation is just a matter of placing library functions properly in the RTS source code. The SDDF is a flexible file meta-format that specifies both data record structures and data record instances [1]. This meta-format is not based on predefined layouts and allows to add new event types easily. Thanks to this we can define our own event family to meet the requirements of our instrumentation. SDDF events are specified by data record structures, which are written at the beginning of each SDDF file. Data record instances in the file contain the trace event information itself, and identifiers linking them to their corresponding data record structures. This idea is the same that underlies the XML format, but in the case of the SDDF both definition structures and data instances can be found in the same file. By capturing an event we obtain specific data depending on its type. This data is saved in the trace file by data record instances according the Pablo SDDF, in ASCII representation (two examples of SDDF data record descriptors are in Figure 3). Each event is characterized by an integer EventID appearing at the beginning of each SDDF data record instance. In order to localize the events, SDDF record instances contain a unique tuple with the machine, process and thread numbers in which the event has occurred. A field called Seconds determines the time of the execution when the corresponding event occurred. Not all events require the same values to be saved in the SDDF records. For example, for the event Send Message we should save integers representing the sending and the receiving processes respectively. For the event Block Thread (Figure 3), we need different information. Thus, according to the nature of each captured event, a different type of SDDF record instance is written into the trace file. The instrumentation of the RTS has been carried out mostly in a separate module, which contains the overloaded trace functions and initializations. To cover all the events, only four existing source code files of the RTS had to be modified. The instrumented Eden RTS is implemented as an additional compiler way 1. The compiler itself (code generation) is not modified at all. In order to use the profiling version, the program only has to be linked to a different RTS. The Pablo 1 Ways are a concept of GHC which allows to select between different compiler and RTS versions when compiling a program.

4 Trace Library is efficiently implemented. Measurements have shown that the performance of Eden programs is hardly affected at all by writing the event files. B. Second Step: Trace Interpretation To read and interpret the trace files produced by the instrumentation of the Eden RTS, we model the logical units of the execution by the object hierarchy shown in Figure 4. EdenObject <<contains>> TraceData EdenMachine EdenProcess EdenThread EdenMessage <<contains>> Fig. 4. <<contains>> <<contains>> Object hierarchy of the EdenTV For the EdenTV, logical units are machines, processes, threads and messages, abstracted by the class EdenObject. The four concrete subclasses derived from the class EdenObject are EdenMachine, EdenProcess, EdenThread and EdenMessage. Each class defines fields to identify the objects and saves information that rebuilds the activity and behavior of each logical unit. The most important class for the reconstruction of the execution is the class TraceData. An instance of this class represents the whole execution of an Eden program, containing arrays of machines, processes, threads and messages. The complete information collected from one trace file is accessible in an object of this class. The TraceData uses a parser class to obtain the data from the trace file and fills the mentioned arrays from it. We model a class for each event of the instrumentation as well. Since each event needs different fields to be saved, new subclasses of the event class were defined for each different event of Figure 2. To explain it shortly, the constructor of the class Trace- Data creates, updates and arranges the objects of the class EdenObject (machines, processes, threads and messages) from the event objects that the parser class produces while reading the trace file. The creation of an object of the class TraceData from a trace file is the trace interpretation part of the EdenTV. The following and last step is to represent the information collected in the TraceData instance graphically. C. Third Step: Trace Representation In the space-time diagrams generated by the EdenTV, machines, processes and threads are represented by horizontal bars, where the horizontal axis of the diagram is the execution time. Messages between processes or machines are optionally represented by grey arrows which start from the sending unit bar and point at the receiving unit bar. The representation of messages is very important for the programmer, since he/she can observe hot spots and inefficiencies in the communication during the execution. TABLE I COLOR CODES FOR BAR GRAPHICS IN EDENTV Machine Process Thread Blue Idle Idle n/a (no processes) (no threads) Yellow System Time Runnable Runnable (threads runnable) (one thread) (this thread) Green Running Running Running (one process) (one thread) (this thread) Red Blocked Blocked Blocked (all processes) (all threads) (this thread) EdenTV produces six diagrams: All Machines, All Processes, All Threads, Processes In Machines, Threads In Processes and Threads In Machines. The diagram Process In Machines is similar to the diagram All Processes, but the processes are arranged by machines and not by process ID. The same holds for the diagrams Threads In Processes and Threads In Machines, which are like the diagram All Threads, but arranged by processes and machines respectively. For this reason, this section only gives a description for the first three diagrams. The diagram bars have segments in different colors, which indicate the activities of the respective logical unit in a period during the execution. These color codes are shown in Table I. The diagram All Machines (Figure 5) shows the activity of all machines that have taken part in the execution. A machine is displayed as idle if it has no process to execute. It is in state system if at least one thread in it is runnable, but no thread is running (garbage collection, receiving messages, other user s processes). The machine is in state running if one thread is running in it. When all threads in the machine are blocked, it is in state blocked. A similar diagram All Processes (Figure 6(a)) shows the activity of all processes. The same four colors are used to differentiate the states of the processes. The conditions for the states system, running and blocked are exactly the same for a process as for a machine. A process is in state idle if the process does not contain any threads 2. In general, the following equation holds for the thread count inside one process or machine: 0 Runnable Threads + Blocked Threads Total Threads Color codes for processes and machines are assigned following the comparison of these numbers. In general, the first condition checked by the EdenTV is if the machine or process is in state idle (Total Processes resp. Total Threads= 0). An inequality on the right implies that a thread is running. Otherwise the state is either runnable or blocked, depending on the number of runnable threads (the state is runnable if Runnable Threads > 0). This context information for all units is the basis of the graphical representations. For the diagram All Threads (Figure 6(b)) the colorcodification is similar, but the colors of the bar fragments 2 An idle process is normally inexistent, but exists in the implementation, so it was modelled in the representation as well.

Fig. 5. All Machines Graphic (a) All Processes Graphic (b) All Threads Graphic Fig. 6. Diagrams of EdenTV correspond to the state of the thread itself.

The instances of this class contain a TraceData instance which is processed to generate a diagram.

Functionality and Mode of Use When a trace file generated by the instrumented RTS is opened, the gathered information is presented in the bar diagrams shown in

5 Fig. 5. All Machines Graphic (a) All Processes Graphic (b) All Threads Graphic Fig. 6. Diagrams of EdenTV correspond to the state of the thread itself. Note that a thread can not be in state idle. To implement the diagram construction, a class GraphicEdenPanel was defined. The instances of this class contain a TraceData instance which is processed to generate a diagram. Since each diagram is generated independently, six subclasses of GraphicEdenPanel were created for the six different diagrams. IV. USING EDENTV A. Functionality and Mode of Use When a trace file generated by the instrumented RTS is opened, the gathered information is presented in the bar diagrams shown in the previous section as well as in textual form. The text field on the right gives detailed statistics about the overall computation and details for each unit (machine, process or thread) shown in the graphic representation. The bar diagrams presented by EdenTV can be zoomed in

6 Fig. 7. Zoomed View with Messages and Interactive mode. order to get a closer view on the activities at critical points during the execution. Furthermore, the user can select to see the messages sent and received by each machine or process during the execution of an Eden program. Messages are not shown by default because showing the messages often hides other details in the diagrams, and due to resource consumption in the viewer when handling bigger trace files. Additionally, an interactive mode provides extra information when a bar in the diagram is clicked. This viewer mode is resource-consuming as well and therefore turned off by default. It provides the information for the event immediately preceding the point in time which has been clicked, i.e. the cause of the current state of the clicked unit. The described options are situated on the lower right of the panel below the text field. Some other options only control the appearance, such as the separation between the displayed bars and the bar style. A more important feature is the ability to leave out units from the display if their lifetime was less than an adjustable minimum (Minimum Time). B. Examples: Tuning Eden Programs with EdenTV 1) Warshall: Lazy Evaluation vs. Parallelism: Using a lazy functional computation language, a crucial issue for Eden as well as other Haskell-based parallelism is to start the evaluation of needed subexpressions early enough and to fully evaluate them for later use. While this has lead to Eden s eager communication in the language design described earlier, evaluation strategies [19] are another well-established means of starting parallel evaluation. However, the basic choice to either evaluate a final result to WHNF or completely (NF) sometimes does not offer enough control to optimize an algorithm strategies must then be applied to certain subresults. On the sparse basis of runtime measurements, such an optimization would be rather cumbersome. The EdenTV, accompanied with code inspection, makes such inefficiencies obvious, as in the following example. Example: The program measured here is a parallel implementation of Warshall s algorithm to compute shortest paths for all nodes of a graph from the adjacency matrix (adapted from [14]). The minimum distances from one node to every other are continuously updated and successively passed through a ring structure linking all processes, starting with distances from the first node. Implementing this algorithm is very simple, provided the existence of a suitable ring skeleton. The task reduces to defining the function each ring process should apply to its input from the parent and from the ring neighbour. The trace visualizations shown in Fig. 8 show the All machines view of the EdenTV for two different warshall programs on 16 processors of a Beowulf cluster, the input graph consisting of 500 nodes. The first straight-forward version of the program shows bad performance, due to the inherent data dependence in the algorithm. Using the trace viewer makes obvious that the first phase of the algorithm runs on several machines, but completely sequentially. Only the second phase of the algorithm really runs in parallel on all machines. The length of this second phase depends on the position in the ring: the first ring process (which started working first) must do the most work in the end, thereby dominating runtime. Viewing the communication, one can see that the nodes communicate as intended, but do not process the received data. The reason for this bad performance is demand-driven evaluation: Data received from a ring neighbour is passed

7 Runtimes:154.2 sec. (above) and 40.7 sec.(below) Fig. 8. Warshall-Algorithm, first (above) and optimized version (Beowulf Cluster, Heriot Watt University, Edinburgh, 16 machines) to the next node unchanged, until the node sends its own rows, which are updated with all ring data received before. Only when the node s updated own rows are passed into the ring, their evaluation begins. Before this inherent demand, the processes only accumulate data from the ring communication, but do not proceed updating their rows with the distances received from the ring neighbour. Inserting a normal form evaluation strategy (rnf) for an intermediate result into the respective node function dramatically improves runtime. We still see the impact of the data dependency, leading to a short wait phase passing through the ring, but the optimized version shows good speedup and load balance. A crucial issue for such optimizations is however to identify the displayed units from their behaviour, without a link to the program s source code. 2) Mandelbrot: Irregularity and Cost of Load Balancing: Parallel skeletons (higher-order functions implementing common patterns of parallel computation [2], [16], [8]) provide an easy parallelization of existing programs by replacing common higher-order functions by parallel equivalents. A prime example is the well-known higher-order function map, which applies a function to all elements of a list, and can be parallelized in different ways. In the case of map, problems may arise when the tasks to solve expose irregular complexity. Irregularity can lead to uneven load distribution when tasks are statically distributed, and the machine with the most complex tasks will dominate the runtime of a parallel computation. By using a feedback from process results to process inputs, work can also be distributed dynamically in a parallel map implementation, thereby adapting to the current load and speed of the different machines. But the dynamic distribution necessarily has more overhead than a static one. Example: We compare these different work distribution schemes using a program which generates Mandelbrot Set visualizations by applying a uniform, but potentially irregular computation to a set of coordinates. The underlying computation scheme, map, is parallelized by two skeletons presented in [8]. 1) Farm: the tasks are statically distributed to a set of workers, one for each available machine. Each worker solves a fixed set of tasks. 2) Workpool: the main machine distributes only few initial tasks among the others (worker machines). Each time a worker has finished one task, the main machine sends a new one, i.e. subsequent tasks are distributed dynamically on request. From previous results in older measurements ([8]), we could expect that the dynamic distribution in the workpool skeleton

distribution, since the tasks have highly irregular complexity.

8 a.- Farm Skeleton (Runtime: 8.6 sec.) b.- Workpool Skeleton (Runtime: 11.5 sec.) Fig. 9. All Processes diagrams for the Mandelbrot program (Beowulf Cluster, Heriot Watt University, Edinburgh, 16 machines) performs better than a static distribution, since the tasks have highly irregular complexity. On the other hand, the dynamic load balancing requires additional, time-critical communication and prevents some optimization in the communication system. The program was executed on 16 machines of a Beowulf- Cluster, with a problem size of pixels. As we can see in the diagrams shown in Fig. 9, the workpool-skeleton

9 does not perform as good as expected. Runtime is longer for the workpool skeleton, where EdenTV reveals that the master node is a serious bottleneck, which receives requests from all other 15 machines and thus spends much time receiving messages (system time). One of the worker machines seems to work slower than the others, most likely due to other tasks on the respective machine. The other machines have spent a lot of time in blocked state, waiting for new work assigned by the heavily loaded master node. We see that all machines stop at the same time, i.e. the system is well synchronized, but has a severe bottleneck. The trace of the farm skeleton shows an obvious load imbalance among the workers. The main machine (bottom bar) merges all results at the end of the computation, leading to a longer sequential end phase. Both versions are far below the theoretical optimum of using all processors in parallel all of the time. The analysis using the EdenTV has revealed why the Farm skeleton has better performance than the Workpool skeleton for this program and the problem size investigated here. By increasing the amount of tasks a worker receives initially, we gain only slightly better performance. Increasing granularity by combining several tasks would lead to less tasks in total and contradicts the intended load balancing properties. A better way is to parallelize the work distribution as well, and to use more than one main machine in a hierarchy of workpools. V. RELATED WORK The EdenTV conceptually follows a rather general concept of tracing tools: trace information is gathered during execution, saved in files, and afterwards processed by a tool which is aware of specific language concepts and the relationship between units of computation. A rather simple (yet insufficient) way to obtain information about the program s parallelism would be to trace the behaviour of the communication subsystem. The Eden RTS currently uses PVM, which comes with a built-in trace capability (using Pablo SDDF as well). Tracing PVM-specific and user-defined actions would be possible and visualisation could be carried out by XPVM [9]. But the weak points of such a solution off-the-shelf are obvious: PVM-tracing yields only information about spawned subtasks and concrete messages between the machines. Internal buffering and threads in the RTS would remain invisible, unless user-defined events are used. So we opted not to bind us to PVM tracing. Switching to a different message passing system would otherwise be very hard, although one can expect every suitable system to come with comparable features. Comparable to the tracing included in XPVM, many efforts have been made in the past to create standard tools for trace analysis and representation (e.g. Pablo Analysis GUI [17] or the ParaGraph Suite [6], to mention only two essential results). These tools have interesting features EdenTV does not yet include, like stream-based online trace analysis, execution replay, and a wide range of standard diagrams. But the aim of EdenTV was a specific visualisation of logical Eden units and needed a more customized solution. The more closely related tracing concepts for parallel functional languages have collectively influenced the EdenTV in tracing and representation. Using the trace format and tools of the Granularity Simulator GranSim [10], one can not only simulate programs in the parallel functional language GpH, but also obtain information about real parallel program executions. Due to the different language concept, GranSim does not trace any kind of communication and offers rather bird s eye views on the execution than to concentrate on the single computation units. The EdenTV diagrams are inspired by the per-processor view of GranSim, commonly known as Gantt diagrams, or spacetime-diagrams when combined with message display [3]. The Eden derivative of GranSim, Paradise [7], was a pure simulator, but it offers an interesting feature not yet present in EdenTV: respecting the explicit coordination concept of Eden, Paradise offers to label instantiated processes in the visualisation, thereby linking the trace information to the program source code. Last but not least, the direct predecessor of EdenTV was the work by Ralf Freitag [4], which was entirely based on the Pablo Analysis GUI (not available any more). VI. CONCLUSION AND FUTURE WORK The Eden Trace Viewer is a combination of an instrumented runtime system to generate, and an interactive GUI to interpret trace information and to represent it in interactive diagrams. These three steps are common to many profiling solutions, separated as much as possible in the EdenTV implementation. Its usefulness for the analysis and optimization of parallel program performance is due to a view of concrete parallel executions. All important aspects of parallel execution are shown in the diagrams, which can be explored in great detail. Specific language concepts of the parallel functional language Eden are present in all parts of the tool, but the described concepts of design and implementation can be applied to target other parallel languages as well. The code of the EdenTV is written in Java and can be easily extended to support a larger set of events and to generate new diagram types. All languages using logical units like processes, threads and messages in a way similar to Eden can be analyzed by such a tool. With minor extensions to the instrumentation carried out for Eden, trace files for GpH programs could be generated in SDDF by the shared RTS. As the instrumentation is cleanly separated from other parts and uses a well-established library, only few work should be necessary for this task. Once the trace files contain new event types, appropriate new objects can be introduced into the code of the EdenTV to parse, interpret and represent these new files. Other logical units than Eden units can be represented in new diagrams. [18] explains in detail where such modifications would be located, as well as reasonable optimizations, which could not be integrated in the code due to time limitations. As another

10 extension, the EdenTV could show an analysis while the program still runs, using stream-based trace file processing. Acknoledgements The authors thank Rita Loogen for her helpful comments improving this paper, and the colleagues from Heriot-Watt- University, Edinburgh for the opportunity to work with their beowulf cluster. REFERENCES [1] R. A. Aydt. The Pablo Self-Defining Data Format. Technical report, University of Illinois. Department of Computer Science, http: //www-pablo.cs.uiuc.edu. [2] M. I. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. Research Monographs in Parallel and Distributed Computing. The MIT Press, Cambridge, MA, [3] I. Foster. Designing and Building Parallel Programs. Addison-Wesley, [4] R. Freitag. Entwicklung eines Werkzeugs zur Analyse des Laufzeitverhaltens paralleler funktionaler Programme. Master s thesis, Philipps- Universität Marburg, Germany, In German. [5] K. Hammond and G. Michaelson, editors. Research Directions in Parallel Functional Programming. Springer-Verlag, [6] M. T. Heath and J. A.. Etheridge. Visualizing the performance of parallel programs. IEEE Software, 8(5), [7] F. Hernández, R. Peña, and F. Rubio. From GranSim to Paradise. In Trends in Functional Programming 1 (SFP 99). Intellect, [8] U. Klusik, R. Loogen, S. Priebe, and F. Rubio. Implementation Skeletons in Eden Low-Effort Parallel Programming. In Implementation of Functional Languages: IFL 00, volume 2011 of LNCS, Aachen, Germany, Springer. [9] J. A. Kohl and G. A. Geist. The PVM 3.4 tracing facility and XPVM 1.1. In Proceedings of HICSS-29, pages IEEE Computer Society Press, [10] H. W. Loidl. GranSim User s Guide. Technical report, University of Glasgow. Department of Computer Science, [11] R. Loogen, Y. Ortega, and R. Peña. Parallel Functional Programming in Eden. Journal of Functional Programming, To appear. [12] S. Peyton Jones, C. Hall, K. Hammond, W. Partain, and P. Wadler. The Glasgow Haskell Compiler: a Technical Overview. In JFIT 93, March [13] S. Peyton Jones and J. Hughes. Haskell 98: A Non-strict, Purely Functional Language, Available at org/. [14] M. Plasmeijer and M. van Eekelen. Functional Programming and Parallel Graph Rewriting. Addison-Wesley, Reading, Massachusetts, USA, [15] Parallel Virtual Machine Reference Manual, Version 3.2. University of Tennessee, August [16] F. A. Rabhi and S. Gorlatch, editors. Patterns and Skeletons for Parallel and Distributed Computing. Springer, [17] D. A. Reed, R. D. Olson, R. A. Aydt, T. Madhyastha, T. Birkett, J. Jensen, B. A. A. Nazief, and B. K. Totty. Scalable performance environments for parallel systems. Technical Report 1673, Dep. of Computer Sc., University of Illinois, Illinois, [18] P. Roldán. Eden Trace Viewer: Ein Werkzeug zur Visualisierung paralleler funktionaler Programme. Master s thesis, Philipps-Universität Marburg, Germany, In German. [19] P. Trinder, K. Hammond, H.-W. Loidl, and S. Peyton Jones. Algorithm + Strategy = Parallelism. J. of Functional Programming, 8(1), January [20] P. W. Trinder, K. Hammond, J. S. M. Jr., A. S. Partridge, and S. L. P. Jones. Gum: A portable parallel implementation of haskell. In PLDI, 1996.

Visualizing Parallel Functional Program Runs: Case Studies with the Eden Trace Viewer

John von Neumann Institute for Computing Visualizing Parallel Functional Program Runs: Case Studies with the Eden Trace Viewer Jost Berthold and Rita Loogen published in Parallel Computing: Architectures,