DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

Similar documents
Study of Load Balancing Schemes over a Video on Demand System

LAPI on HPS Evaluating Federation

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1

Fractals. Investigating task farms and load imbalance

Fractals exercise. Investigating task farms and load imbalance

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

Transactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Job Re-Packing for Enhancing the Performance of Gang Scheduling

The former pager tasks have been replaced in 7.9 by the special savepoint tasks.

Configuration Guideline for CANopen Networks

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

G Robert Grimm New York University

Network-Adaptive Video Coding and Transmission

Comparing Centralized and Decentralized Distributed Execution Systems

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Why Study Multimedia? Operating Systems. Multimedia Resource Requirements. Continuous Media. Influences on Quality. An End-To-End Problem

Last Class: RPCs and RMI. Today: Communication Issues

Database Server. 2. Allow client request to the database server (using SQL requests) over the network.

I/O in the Gardens Non-Dedicated Cluster Computing Environment

Benchmarking results of SMIP project software components

Chapter 7 Multimedia Operating Systems

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

Final Project Writeup

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Operating System Support for Multimedia. Slides courtesy of Tay Vaughan Making Multimedia Work

Introduction to Real-time Systems. Advanced Operating Systems (M) Lecture 2

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems

Distributed Information Processing

MULTIMEDIA ADAPTATION FOR DYNAMIC ENVIRONMENTS

Execution Architecture

Multimedia Systems 2011/2012

1995 Paper 10 Question 7

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Assignment 5. Georgia Koloniari

6.1 Multiprocessor Computing Environment

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

Frank Miller, George Apostolopoulos, and Satish Tripathi. University of Maryland. College Park, MD ffwmiller, georgeap,

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Grid Computing Systems: A Survey and Taxonomy

Module 6 STILL IMAGE COMPRESSION STANDARDS

Some popular Operating Systems include Linux Operating System, Windows Operating System, VMS, OS/400, AIX, z/os, etc.

Multimedia Communications. Transform Coding

Continuous Real Time Data Transfer with UDP/IP

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Design of Parallel Algorithms. Course Introduction

The Transport Layer: User Datagram Protocol

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

PowerVR Series5. Architecture Guide for Developers

Receive Livelock. Robert Grimm New York University

Module objectives. Integrated services. Support for real-time applications. Real-time flows and the current Internet protocols

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Page 1 SPACEWIRE SEMINAR 4/5 NOVEMBER 2003 JF COLDEFY / C HONVAULT

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Network Systems for Emerging WAN Applications

An Optimized Search Mechanism for Large Distributed Systems

CCNA Exploration1 Chapter 3: Application Layer Functionality and Protocols

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Architectural Styles II

Introduction 1.1 SERVER-CENTRIC IT ARCHITECTURE AND ITS LIMITATIONS

Real-Time Protocol (RTP)

The BANDIT can also concentrate and switch multiple sources of Frame Relay traffic simultaneously.

Quality of Service II

Syllabus for Computer Science General Part I

Multiprocessor Systems. Chapter 8, 8.1

Protocols SPL/ SPL

Memory-Link Compression Schemes: A Value Locality Perspective

Chapter 4 NETWORK HARDWARE

ENSC 427: COMMUNICATION NETWORKS (Spring 2011) Final Report

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

L1: Introduction. Hui Chen, Ph.D. Dept. of Engineering & Computer Science Virginia State University Petersburg, VA 23806

Real-time grid computing for financial applications

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

Towards Quality of Service Based Resource Management for Cluster-Based Image Database

Lecture 13. Quality of Service II CM0256

Design Patterns. SE3A04 Tutorial. Jason Jaskolka

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences

Chapter 8: Memory-Management Strategies

Improving Http-Server Performance by Adapted Multithreading

Running High Performance Computing Workloads on Red Hat Enterprise Linux

Data and Computer Communications. Protocols and Architecture

Optimizing TCP in a Cluster of Low-End Linux Machines

QoS-Aware IPTV Routing Algorithms

Communication. Overview

From Cluster Monitoring to Grid Monitoring Based on GRM *

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Internetworking With TCP/IP

Adaptive Cluster Computing using JavaSpaces

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Future Generation Computer Systems. Task granularity policies for deploying bag-of-task applications on global grids

Lecture 9: Load Balancing & Resource Allocation

Introduction to Parallel Computing

Query Answering Using Inverted Indexes

The Google File System

Course 40045A: Microsoft SQL Server for Oracle DBAs

Using Time Division Multiplexing to support Real-time Networking on Ethernet

Transcription:

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678 Clausthal-Zellerfeld, Germany Distributed platforms are not necessarily well-suited for systems which handle large data sets, such as processed in multimedia applications. In this paper a specialised computation model, based on asynchronous transmission, is presented. As the necessary functions are encapsulated this system can be used without detailed knowledge of the system architecture. A dynamic strategy of task execution is utilised to adjust the number and size of the distributed data packages according to the computational load of the processing elements at transmission time. Thus more powerful PE s, or those whose resources are not fully utilised, will either receive packages more frequently or will be given larger packages. In large networks some nodes can be replaced by others or only a few data blocks may be sent to (a) particular node(s). The efficiency of the method is evaluated with a variety of practical run time measurements. 1 Introduction Distributed systems consisting of a network of workstations are increasingly being used for solving compute intensive problems. Distributed platforms are, however, not always well-suited for systems which handle large data sets as can be found in, for example, multimedia applications. The limiting factor for processing of large data sets is usually network bandwidth. Thus, the distribution of huge amounts of data bounds the overall processing speed. This situation is made worse by the fact that data is transmitted only when requested or sent by the parallel processes. In order to reduce this effect, the transmission of data should be separated from the process synchronisation. Well-known software systems for parallel/distributed processing on existing computer networks are PVM, MPI, PVMPI, Condor, Mosix [1-5] and Treadmarks. An advantage of the PVM is its availability on nearly all important architectures and operating systems. On the other hand synchronous data transfers and type conversions are time consuming, making it unsuitable for the processing of large multimedia data sets. 1.1 Multimedia data Multimedia data has become an important component of modern software systems. Static media (images, graphics, text) are combined with dynamic media (audio, video, animations) to obtain realistic representations of natural processes, for the visualisation of complex results or to depict dynamic processes. In spite of the increases in memory sizes, processing and communication speeds, the processing and communication of multimedia data is still time and submitted to World Scientific : 11.10.99 : 13:47 1/1

compute intensive. Some of the initial problems, essentially data compression, could be solved by the development of efficient compression algorithms, e.g. JPEG, MPEG, MP3. Many of these algorithms have been implemented in hardware, offering the possibility of real time encoding. The next step resulted in parallelising numerous procedures for processing multimedia data. Static media, such as encountered in image processing applications, are usually subdivided into independent data fragments, which are then distributed among a number of processing elements. The results are gathered and combined to form the final result. In dynamic media, interdependencies between the different data blocks must be considered and resolved. An example for this is MPEG compression, which is based on finding and eliminating redundant information in consecutive frames. Parallelisation by means of data segmentation is well-suited for parallel computers with shared memory, since little or no time is spent on communicating the data. Software for distributed computing in heterogeneous networks will have less of a performance gain, because of the slow synchronous transfers and greater variances in client resources. If the operations executed are simple these delay effects can be seen quite clearly. An example for this is the calculation of correlation coefficients for short term series [8]. Considering all combinations between 100 shares and a time difference of 5 days resulted in 681450 correlation terms and 27 megabytes of data. The performance gain by parallelising the algorithm with the PVM among 4 DEC Alphas was negated by the resulting administration overhead. This resulted in the run time on a single workstation being up to 6 times faster than the parallel PVM version. These requirements (large data sets, simple operations) are also found in the management, retrieval and processing of multimedia data. Current approaches to multimedia databases are based on the extraction and management of specific characteristics. Queries compare the extracted characteristics with all images stored in the database, and return the most similar images. Each archival and retrieval process results in the computation of huge amounts of data. Performance gains through parallelisation are negated by transfer times and administration of the data, as described in the correlation example. This results in the necessity of a specialised model for parallel processing of huge amounts of data. 2 Processing model for static multimedia data The proposed processing model aims to make development of parallel programs by non-experienced users easy, and minimise the communication and management effort, by using TCP/IP sockets directly. Similar to the work pile model [6], this model is based on the creation of pools of tasks, which are controlled by three special processes (distribution and collection manager, computation client). The information is divided into sections which are distributed to a number of processing elements (Figure 1). submitted to World Scientific : 11.10.99 : 13:47 2/2

PE 1 Pool of Tasks Distribution manager PE 2 : Collection manager Pool of Results PE n Figure 1: Schematic representation of the processing model 2.1 Distribution manager The distribution manager is responsible for the division and management of the data packets to be processed. Push technology is used to minimise the transfer cost between server and clients. The responsibility of the distribution manager includes data packets definition, management of data packets in the local pool of tasks, processing of client requests and distribution of the data packets among the processing elements. The distribution strategy is set within this process. Essential requirements include the efficient use of available resources, as well as being failure tolerant. To circumvent problems related to processing element failures the data packets are subdivided into three groups: the first group consists of packages which were not yet distributed, the second group comprises transmitted, but unprocessed data, whereas the third group consists of processed data packets. A simple distribution strategy of available data packets increases computing efficiency. If the first group is empty, but non-processed data blocks are still in the second group, then these are dispatched to idle clients, which have already completed their computation tasks. This can be achieved by generating a list of all available active nodes and of the status of their local pools of tasks. The number of distributed but not yet processed packets can be calculated from the number of packets sent, but not yet received by the collection task. This requires a direct connection between the distributor and the collector. The difference is analysed and compared to a given threshold values. If it is below the threshold the distribution manager sends new packets to the client. This strategy requires a time and/or workload oriented distribution of the data packets as well, since processing can only occur if the processing element has a low CPU load. A blocked client that does not satisfy this requirement is regarded as a node that has failed. The server will redistribute the data packets sent to this client. 2.2 Computation client This component performs the computation on each processing element. A simple and compact structure reduces the management overhead and enables an important performance increase. The computation client consists of a local pool of tasks, a submitted to World Scientific : 11.10.99 : 13:47 3/3

processing object and a local pool of results. In this pool the processed data packets are temporarily stored until a connection for the transfer to the collection manager becomes available. 2.3 Collection manager This process accepts processed data packets from the computation clients and stores them until all data packets have been received in the pool of results. Once this occurs, it composes the processed original from the received data packets. A picture or a series of pictures would be composed at this point during e.g. JPEG-encoding. Furthermore, the collector sends a message giving the number of received data packets to the distributor. From this information the distribution manager determines the current workload of each client and redefines the distribution strategy. The distributor is also notified when all data packets have reached the collector and the processing is completed. 2.4 Arraying in multiple hierarchical levels The described model consists of two hierarchical levels, containing the distribution and collection processes on one level, and computation clients on the other. This model will reach its capacities quickly with a large number of non-local processing elements. An alternative is to arrange servers hierarchically. The lower levels of this hierarchy contain not only clients, but subordinated servers as well, which distribute the data packets to lower level clients. An example for the application of such a model are data distributions in corporate or university networks: a super server sends data packets to subordinate servers in each division. Each of these servers initiates the computation in its own domain. This significantly reduces the communication complexity, or at least binds it locally. The processed packets are still sent to a central collector making dynamic regrouping possible. The clients of a new group will then receive their packets from the server of the new group. Marking the processed data packets with the id of the group which processed them is mandatory. This allows the collector to find out which group processed each data packet so that this group is resupplied with data to process once it drops below a given threshold. 3 An adaptive distribution strategy Heterogeneous networks consist of processing elements with different performance capabilities (CPU, memory etc). Information about the complexity of tasks being processed is usually not available. Furthermore, the number of users working on a particular workstation are continuously changing. Thus it is impossible to predict the performance of any particular workstation in a network at a given time. This submitted to World Scientific : 11.10.99 : 13:47 4/4

makes it impossible to a priori schedule task processing. A dynamic distribution strategy of processing tasks is thus needed. The number and size of the distributed data packages must be adapted to the work load of the processing element at transmission time. Even this strategy may not be near optimal, as additional tasks can be started on the PE between the determination of the current load and the arrival of data packages. More powerful PE s or those with a small performance utilisation will receive packets more frequently or will be allocated larger packages. In large networks some low performance nodes can be skipped and the work distributed to more powerful PE s. If this is not possible the data blocks sent to the low performance nodes will automatically be adapted. For the concrete realisation of this method a performance ranking must be generated. This can be done by calculating the difference between sent and processed packages as described above. In the first distribution run each processing element is supplied with n packages. After a certain time interval a performance rank list is created. The number of packets for the respective processing elements are then increased or decreased. This operation is repeated until the collector has received all data. Alternatively the packet size can be adapted. Larger packets are sent to the PE s at the top of the performance list. This can minimise the communication and network traffic. However, this is not always possible. For example, an image is usually subdivided into n sections. If all sections are distributed during the first run a change of the package size is not possible without a loss of already processed data. This performance information can only be used if the image has large dimensions, or if a whole image sequence is to be processed. A disadvantage of this model is that additional logic for the management of dynamic block sizes is necessary in the clients. Furthermore the complexity of the model tasks and the requirements regarding the user knowledge are increased. 4 The usage of the system The data flow of the proposed model for the parallel processing of multimedia data involves the following steps: The generated data packets are put into the pool of tasks when processing starts and the distribution manager is initialised. The data packets, received by the clients, are stored in the local pool of tasks, which is essentially a queue. Afterwards the computation starts. Processed data is stored in the local pool of results and sent to the collection manager. The collection manager informs the distribution manager of the receipt of the processed packets. When all data packets have been received, the so-called NULL-packet is distributed. Every processing element which receives a NULL-packet immediately terminates processing. An object oriented system design will help making system components reusable and lessens the difficulty of using the distribution models. The most submitted to World Scientific : 11.10.99 : 13:47 5/5

important class is the processing class. It does the actual processing and is the focal point of the model. All other classes support it by managing the administration, reception and distribution of data. The parameter of its run()-method contains the data to be processed. The packet is processed in this method, stored in the local pool of results by means of a return call and is then sent back. The usage of this system merely requires an overloading of the run()-method of the processing class, adjusting the class for special problems. The distribution and collection manager have to be initialised at the beginning of a session. Furthermore, the required processes need to be launched in the processing nodes. These will then contact the distributor and collector on their own. At this stage the system will be idle. The pool of tasks is now filled with the required packets. Once this has been done, the distributor is activated and the data is processed. All processed packets are stored in the pool of results. Manipulating the packet size requires overloading of the methods that split and merge the packets. 5 Performance measurements The measurements were performed on a cluster of Linux K6, 300 MHz PCs connected over a 10Mbit Ethernet. In a first attempt different block sizes and number of iterations as well as various configurations of the processing model were examined in order to obtain data about the efficiency and the run time behaviour of the proposed system. Table 1: Measurement results (run times, speedup and efficiency) with the implemented prototype Iterations Time[s]: 1 PE Time[s]/Sp/Ep : 2 PE Time[s]/Sp/Ep : 3 PE Time[s]/Sp/Ep : 4 PE 10 35.693 25.373/ 1.407/ 0.703 22.100/ 1.615/ 0.538 22.455/ 1.590/ 0.397 30 43.153 27.213/ 1.586/ 0.793 24.640/ 1.751/ 0.584 22.252/ 1.939/ 0.485 50 51.373 31.373/ 1.637/ 0.819 27.739/ 1.852/ 0.617 24.050/ 2.136/ 0.534 70 61.534 34.993/ 1.758/ 0.879 30.220/ 2.036/ 0.679 25.540/ 2.409/ 0.602 90 69.813 39.493/ 1.768/ 0.884 31.519/ 2.215/ 0.738 27.919/ 2.501/ 0.625 110 76.353 42.301/ 1.805/ 0.902 34.430/ 2.218/ 0.739 30.370/ 2.514/ 0.629 130 85.553 46.779/ 1.829/ 0.914 35.820/ 2.388/ 0.796 30.591/ 2.797/ 0.699 150 93.713 50.519/ 1.855/ 0.928 39.039/ 2.400/ 0.800 32.081/ 2.921/ 0.730 170 102.233 55.369/ 1.846/ 0.923 42.080/ 2.429/ 0.810 34.179/ 2.991/ 0.748 190 110.733 60.659/ 1.825/ 0.913 44.100/ 2.511/ 0.837 35.100/ 3.155/ 0.789 200 115.953 59.819/ 1.938/ 0.969 45.849/ 2.529/ 0.843 35.130/ 3.301/ 0.825 Table 1 shows the run times needed for 10 200 iterations of a simple inverting operation performed on a 10 Mbyte large block as well as the speedup factor S P submitted to World Scientific : 11.10.99 : 13:47 6/6

and the efficiency E P. The data is subdivided into 16384 byte large subsections and according to the strategy described distributed to the single PE clients. Speedup values between 1.4 and 3.3 are reached in this simple application. At the beginning the network communication is the most influencing factor resulting in speedups between 1.407 (2 PE s) and 1.59 (4 PE s). With larger numbers of iterations a linear increase of the speedup values can be observed reaching top speedup values of 3.3 in case of 4 PE s and 200 iterations. The efficiency decreases only slightly, e.g. there is a difference of 0.24 between the mean values of the two and four PE systems. Thus the scalability of the system model appears to be good. A clearer description of the results is given in figure 2. The right hand diagram shows the run times of the different system configurations, the left hand diagram contains the mean speedup and efficiency values for the parallel configurations. 3,0 2,5 Speedup and Efficiency (mean values ) Speedup Efficiency 120 100 80 1 Client 2 Clients 3 Clients 4 Clients 2,0 1,5 1,0 0,5 Time [s] 60 40 20 0,0 2 3 4 Processing Elements 0 10 30 50 70 90 110 Iterations 130 150 170 190 Figure 2: A diagram of the speedup and efficiency values achieved (left); run times for 1-4 PEs (right) The achieved results are compared to the mean speedup and efficiency values of the PVM, which are shown in figure 3. The measurements are performed on the same configurations (K6 with Linux, distribution of 16384 byte large blocks) and type conversion disabled. Speedup and Efficiency PVM (mean values) 3 2,5 2 Speedup Efficiency 1,5 1 0,5 0 2 3 4 Processing Elements Figure 3: A diagram of the PVM average speedup and efficiency values submitted to World Scientific : 11.10.99 : 13:47 7/7

An analysis of the PVM results shows slightly better speedup and efficiency values in case of two processing elements. These decrease when larger numbers of PE s are used. The effort of management and transfer clearly reduces the performance. Thus the proposed system model reaches a five times better speedup and efficiency in case of configurations with four PEs. 6 Conclusions In this paper a specialised computation model based on asynchronous transmission is presented, which automatically adapts to the workload of the elements in the parallel environment at transmission time, enables easy development of parallel programs and minimises the communication and management effort by direct use of TCP/IP sockets. It is based on the creation of pools of tasks, which are controlled by three special modules. A simple distribution strategy of the available packages increases the computing efficiency. More powerful processing elements or such with a small workload will more frequently receive packages. Additionally the package size can be adapted. The efficiency of the proposed method is evaluated through a variety of performance measurements. The results are compared with the results of the PVM. Future work includes extensions, which primarily concern improving the system s performance. Storing the packets in the local file system, similar to a spool-directory, makes it possible to save all packets of the same type that are to be processed in a special directory. Furthermore, comparative benchmarks with other systems are to be performed. References: 1. PVM Home page: Documentation, comparison between various packages, www.epm.ornl.gov/pvm 2. CONDOR Project description, documentation, www.cs.wisc.edu/condor/ 3. MPI Project Home page: Documentation, tutorials, etc, www.mpi-forum.org 4. Mosix Home page: www.cs.huji.ac.il/mosix/ 5. Information about PVMPI: www.cs.utk.edu/~fagg/pvmpi/ 6. S. Keinman, D. Shah, Programming with Threads, Prentice Hall, 1995 7. B. Wilkinson, M. Allen: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Prentice Hall, 1998 8. O. Sachs, Analyse von Aktienreihen mittels paralleler Korrelationsberechnungen, Master thesis, TU Clausthal, 1998 submitted to World Scientific : 11.10.99 : 13:47 8/8