DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678 Clausthal-Zellerfeld, Germany Distributed platforms are not necessarily well-suited for systems which handle large data sets, such as processed in multimedia applications. In this paper a specialised computation model, based on asynchronous transmission, is presented. As the necessary functions are encapsulated this system can be used without detailed knowledge of the system architecture. A dynamic strategy of task execution is utilised to adjust the number and size of the distributed data packages according to the computational load of the processing elements at transmission time. Thus more powerful PE s, or those whose resources are not fully utilised, will either receive packages more frequently or will be given larger packages. In large networks some nodes can be replaced by others or only a few data blocks may be sent to (a) particular node(s). The efficiency of the method is evaluated with a variety of practical run time measurements. 1 Introduction Distributed systems consisting of a network of workstations are increasingly being used for solving compute intensive problems. Distributed platforms are, however, not always well-suited for systems which handle large data sets as can be found in, for example, multimedia applications. The limiting factor for processing of large data sets is usually network bandwidth. Thus, the distribution of huge amounts of data bounds the overall processing speed. This situation is made worse by the fact that data is transmitted only when requested or sent by the parallel processes. In order to reduce this effect, the transmission of data should be separated from the process synchronisation. Well-known software systems for parallel/distributed processing on existing computer networks are PVM, MPI, PVMPI, Condor, Mosix [1-5] and Treadmarks. An advantage of the PVM is its availability on nearly all important architectures and operating systems. On the other hand synchronous data transfers and type conversions are time consuming, making it unsuitable for the processing of large multimedia data sets. 1.1 Multimedia data Multimedia data has become an important component of modern software systems. Static media (images, graphics, text) are combined with dynamic media (audio, video, animations) to obtain realistic representations of natural processes, for the visualisation of complex results or to depict dynamic processes. In spite of the increases in memory sizes, processing and communication speeds, the processing and communication of multimedia data is still time and submitted to World Scientific : 11.10.99 : 13:47 1/1
compute intensive. Some of the initial problems, essentially data compression, could be solved by the development of efficient compression algorithms, e.g. JPEG, MPEG, MP3. Many of these algorithms have been implemented in hardware, offering the possibility of real time encoding. The next step resulted in parallelising numerous procedures for processing multimedia data. Static media, such as encountered in image processing applications, are usually subdivided into independent data fragments, which are then distributed among a number of processing elements. The results are gathered and combined to form the final result. In dynamic media, interdependencies between the different data blocks must be considered and resolved. An example for this is MPEG compression, which is based on finding and eliminating redundant information in consecutive frames. Parallelisation by means of data segmentation is well-suited for parallel computers with shared memory, since little or no time is spent on communicating the data. Software for distributed computing in heterogeneous networks will have less of a performance gain, because of the slow synchronous transfers and greater variances in client resources. If the operations executed are simple these delay effects can be seen quite clearly. An example for this is the calculation of correlation coefficients for short term series [8]. Considering all combinations between 100 shares and a time difference of 5 days resulted in 681450 correlation terms and 27 megabytes of data. The performance gain by parallelising the algorithm with the PVM among 4 DEC Alphas was negated by the resulting administration overhead. This resulted in the run time on a single workstation being up to 6 times faster than the parallel PVM version. These requirements (large data sets, simple operations) are also found in the management, retrieval and processing of multimedia data. Current approaches to multimedia databases are based on the extraction and management of specific characteristics. Queries compare the extracted characteristics with all images stored in the database, and return the most similar images. Each archival and retrieval process results in the computation of huge amounts of data. Performance gains through parallelisation are negated by transfer times and administration of the data, as described in the correlation example. This results in the necessity of a specialised model for parallel processing of huge amounts of data. 2 Processing model for static multimedia data The proposed processing model aims to make development of parallel programs by non-experienced users easy, and minimise the communication and management effort, by using TCP/IP sockets directly. Similar to the work pile model [6], this model is based on the creation of pools of tasks, which are controlled by three special processes (distribution and collection manager, computation client). The information is divided into sections which are distributed to a number of processing elements (Figure 1). submitted to World Scientific : 11.10.99 : 13:47 2/2
PE 1 Pool of Tasks Distribution manager PE 2 : Collection manager Pool of Results PE n Figure 1: Schematic representation of the processing model 2.1 Distribution manager The distribution manager is responsible for the division and management of the data packets to be processed. Push technology is used to minimise the transfer cost between server and clients. The responsibility of the distribution manager includes data packets definition, management of data packets in the local pool of tasks, processing of client requests and distribution of the data packets among the processing elements. The distribution strategy is set within this process. Essential requirements include the efficient use of available resources, as well as being failure tolerant. To circumvent problems related to processing element failures the data packets are subdivided into three groups: the first group consists of packages which were not yet distributed, the second group comprises transmitted, but unprocessed data, whereas the third group consists of processed data packets. A simple distribution strategy of available data packets increases computing efficiency. If the first group is empty, but non-processed data blocks are still in the second group, then these are dispatched to idle clients, which have already completed their computation tasks. This can be achieved by generating a list of all available active nodes and of the status of their local pools of tasks. The number of distributed but not yet processed packets can be calculated from the number of packets sent, but not yet received by the collection task. This requires a direct connection between the distributor and the collector. The difference is analysed and compared to a given threshold values. If it is below the threshold the distribution manager sends new packets to the client. This strategy requires a time and/or workload oriented distribution of the data packets as well, since processing can only occur if the processing element has a low CPU load. A blocked client that does not satisfy this requirement is regarded as a node that has failed. The server will redistribute the data packets sent to this client. 2.2 Computation client This component performs the computation on each processing element. A simple and compact structure reduces the management overhead and enables an important performance increase. The computation client consists of a local pool of tasks, a submitted to World Scientific : 11.10.99 : 13:47 3/3
processing object and a local pool of results. In this pool the processed data packets are temporarily stored until a connection for the transfer to the collection manager becomes available. 2.3 Collection manager This process accepts processed data packets from the computation clients and stores them until all data packets have been received in the pool of results. Once this occurs, it composes the processed original from the received data packets. A picture or a series of pictures would be composed at this point during e.g. JPEG-encoding. Furthermore, the collector sends a message giving the number of received data packets to the distributor. From this information the distribution manager determines the current workload of each client and redefines the distribution strategy. The distributor is also notified when all data packets have reached the collector and the processing is completed. 2.4 Arraying in multiple hierarchical levels The described model consists of two hierarchical levels, containing the distribution and collection processes on one level, and computation clients on the other. This model will reach its capacities quickly with a large number of non-local processing elements. An alternative is to arrange servers hierarchically. The lower levels of this hierarchy contain not only clients, but subordinated servers as well, which distribute the data packets to lower level clients. An example for the application of such a model are data distributions in corporate or university networks: a super server sends data packets to subordinate servers in each division. Each of these servers initiates the computation in its own domain. This significantly reduces the communication complexity, or at least binds it locally. The processed packets are still sent to a central collector making dynamic regrouping possible. The clients of a new group will then receive their packets from the server of the new group. Marking the processed data packets with the id of the group which processed them is mandatory. This allows the collector to find out which group processed each data packet so that this group is resupplied with data to process once it drops below a given threshold. 3 An adaptive distribution strategy Heterogeneous networks consist of processing elements with different performance capabilities (CPU, memory etc). Information about the complexity of tasks being processed is usually not available. Furthermore, the number of users working on a particular workstation are continuously changing. Thus it is impossible to predict the performance of any particular workstation in a network at a given time. This submitted to World Scientific : 11.10.99 : 13:47 4/4
makes it impossible to a priori schedule task processing. A dynamic distribution strategy of processing tasks is thus needed. The number and size of the distributed data packages must be adapted to the work load of the processing element at transmission time. Even this strategy may not be near optimal, as additional tasks can be started on the PE between the determination of the current load and the arrival of data packages. More powerful PE s or those with a small performance utilisation will receive packets more frequently or will be allocated larger packages. In large networks some low performance nodes can be skipped and the work distributed to more powerful PE s. If this is not possible the data blocks sent to the low performance nodes will automatically be adapted. For the concrete realisation of this method a performance ranking must be generated. This can be done by calculating the difference between sent and processed packages as described above. In the first distribution run each processing element is supplied with n packages. After a certain time interval a performance rank list is created. The number of packets for the respective processing elements are then increased or decreased. This operation is repeated until the collector has received all data. Alternatively the packet size can be adapted. Larger packets are sent to the PE s at the top of the performance list. This can minimise the communication and network traffic. However, this is not always possible. For example, an image is usually subdivided into n sections. If all sections are distributed during the first run a change of the package size is not possible without a loss of already processed data. This performance information can only be used if the image has large dimensions, or if a whole image sequence is to be processed. A disadvantage of this model is that additional logic for the management of dynamic block sizes is necessary in the clients. Furthermore the complexity of the model tasks and the requirements regarding the user knowledge are increased. 4 The usage of the system The data flow of the proposed model for the parallel processing of multimedia data involves the following steps: The generated data packets are put into the pool of tasks when processing starts and the distribution manager is initialised. The data packets, received by the clients, are stored in the local pool of tasks, which is essentially a queue. Afterwards the computation starts. Processed data is stored in the local pool of results and sent to the collection manager. The collection manager informs the distribution manager of the receipt of the processed packets. When all data packets have been received, the so-called NULL-packet is distributed. Every processing element which receives a NULL-packet immediately terminates processing. An object oriented system design will help making system components reusable and lessens the difficulty of using the distribution models. The most submitted to World Scientific : 11.10.99 : 13:47 5/5
important class is the processing class. It does the actual processing and is the focal point of the model. All other classes support it by managing the administration, reception and distribution of data. The parameter of its run()-method contains the data to be processed. The packet is processed in this method, stored in the local pool of results by means of a return call and is then sent back. The usage of this system merely requires an overloading of the run()-method of the processing class, adjusting the class for special problems. The distribution and collection manager have to be initialised at the beginning of a session. Furthermore, the required processes need to be launched in the processing nodes. These will then contact the distributor and collector on their own. At this stage the system will be idle. The pool of tasks is now filled with the required packets. Once this has been done, the distributor is activated and the data is processed. All processed packets are stored in the pool of results. Manipulating the packet size requires overloading of the methods that split and merge the packets. 5 Performance measurements The measurements were performed on a cluster of Linux K6, 300 MHz PCs connected over a 10Mbit Ethernet. In a first attempt different block sizes and number of iterations as well as various configurations of the processing model were examined in order to obtain data about the efficiency and the run time behaviour of the proposed system. Table 1: Measurement results (run times, speedup and efficiency) with the implemented prototype Iterations Time[s]: 1 PE Time[s]/Sp/Ep : 2 PE Time[s]/Sp/Ep : 3 PE Time[s]/Sp/Ep : 4 PE 10 35.693 25.373/ 1.407/ 0.703 22.100/ 1.615/ 0.538 22.455/ 1.590/ 0.397 30 43.153 27.213/ 1.586/ 0.793 24.640/ 1.751/ 0.584 22.252/ 1.939/ 0.485 50 51.373 31.373/ 1.637/ 0.819 27.739/ 1.852/ 0.617 24.050/ 2.136/ 0.534 70 61.534 34.993/ 1.758/ 0.879 30.220/ 2.036/ 0.679 25.540/ 2.409/ 0.602 90 69.813 39.493/ 1.768/ 0.884 31.519/ 2.215/ 0.738 27.919/ 2.501/ 0.625 110 76.353 42.301/ 1.805/ 0.902 34.430/ 2.218/ 0.739 30.370/ 2.514/ 0.629 130 85.553 46.779/ 1.829/ 0.914 35.820/ 2.388/ 0.796 30.591/ 2.797/ 0.699 150 93.713 50.519/ 1.855/ 0.928 39.039/ 2.400/ 0.800 32.081/ 2.921/ 0.730 170 102.233 55.369/ 1.846/ 0.923 42.080/ 2.429/ 0.810 34.179/ 2.991/ 0.748 190 110.733 60.659/ 1.825/ 0.913 44.100/ 2.511/ 0.837 35.100/ 3.155/ 0.789 200 115.953 59.819/ 1.938/ 0.969 45.849/ 2.529/ 0.843 35.130/ 3.301/ 0.825 Table 1 shows the run times needed for 10 200 iterations of a simple inverting operation performed on a 10 Mbyte large block as well as the speedup factor S P submitted to World Scientific : 11.10.99 : 13:47 6/6
and the efficiency E P. The data is subdivided into 16384 byte large subsections and according to the strategy described distributed to the single PE clients. Speedup values between 1.4 and 3.3 are reached in this simple application. At the beginning the network communication is the most influencing factor resulting in speedups between 1.407 (2 PE s) and 1.59 (4 PE s). With larger numbers of iterations a linear increase of the speedup values can be observed reaching top speedup values of 3.3 in case of 4 PE s and 200 iterations. The efficiency decreases only slightly, e.g. there is a difference of 0.24 between the mean values of the two and four PE systems. Thus the scalability of the system model appears to be good. A clearer description of the results is given in figure 2. The right hand diagram shows the run times of the different system configurations, the left hand diagram contains the mean speedup and efficiency values for the parallel configurations. 3,0 2,5 Speedup and Efficiency (mean values ) Speedup Efficiency 120 100 80 1 Client 2 Clients 3 Clients 4 Clients 2,0 1,5 1,0 0,5 Time [s] 60 40 20 0,0 2 3 4 Processing Elements 0 10 30 50 70 90 110 Iterations 130 150 170 190 Figure 2: A diagram of the speedup and efficiency values achieved (left); run times for 1-4 PEs (right) The achieved results are compared to the mean speedup and efficiency values of the PVM, which are shown in figure 3. The measurements are performed on the same configurations (K6 with Linux, distribution of 16384 byte large blocks) and type conversion disabled. Speedup and Efficiency PVM (mean values) 3 2,5 2 Speedup Efficiency 1,5 1 0,5 0 2 3 4 Processing Elements Figure 3: A diagram of the PVM average speedup and efficiency values submitted to World Scientific : 11.10.99 : 13:47 7/7
An analysis of the PVM results shows slightly better speedup and efficiency values in case of two processing elements. These decrease when larger numbers of PE s are used. The effort of management and transfer clearly reduces the performance. Thus the proposed system model reaches a five times better speedup and efficiency in case of configurations with four PEs. 6 Conclusions In this paper a specialised computation model based on asynchronous transmission is presented, which automatically adapts to the workload of the elements in the parallel environment at transmission time, enables easy development of parallel programs and minimises the communication and management effort by direct use of TCP/IP sockets. It is based on the creation of pools of tasks, which are controlled by three special modules. A simple distribution strategy of the available packages increases the computing efficiency. More powerful processing elements or such with a small workload will more frequently receive packages. Additionally the package size can be adapted. The efficiency of the proposed method is evaluated through a variety of performance measurements. The results are compared with the results of the PVM. Future work includes extensions, which primarily concern improving the system s performance. Storing the packets in the local file system, similar to a spool-directory, makes it possible to save all packets of the same type that are to be processed in a special directory. Furthermore, comparative benchmarks with other systems are to be performed. References: 1. PVM Home page: Documentation, comparison between various packages, www.epm.ornl.gov/pvm 2. CONDOR Project description, documentation, www.cs.wisc.edu/condor/ 3. MPI Project Home page: Documentation, tutorials, etc, www.mpi-forum.org 4. Mosix Home page: www.cs.huji.ac.il/mosix/ 5. Information about PVMPI: www.cs.utk.edu/~fagg/pvmpi/ 6. S. Keinman, D. Shah, Programming with Threads, Prentice Hall, 1995 7. B. Wilkinson, M. Allen: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Prentice Hall, 1998 8. O. Sachs, Analyse von Aktienreihen mittels paralleler Korrelationsberechnungen, Master thesis, TU Clausthal, 1998 submitted to World Scientific : 11.10.99 : 13:47 8/8