Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation

Size: px

Start display at page:

Download "Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation"

Arleen Jordan
6 years ago
Views:

1 International Journal of Networking and Computing ISSN (print) ISSN (online) Volume 3, Number 2, pages , July 213 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation Md. Anisuzzaman Siddique, Asif Zaman, Md. Mahbubul Islam Computer Science and Engineering Department University of Rajshahi Rajshahi-625, Bangladesh anis and Yasuhiko Morimoto Graduate School of Engineering Hiroshima University Kagamiyama 1-7-1, Higashi-Hiroshima , Japan Received: February 2, 213 Revised: May 14, 213 Accepted: June 18, 213 Communicated by Sayaka Kamei Abstract The existing k-dominant skyline solutions are restricted to centralized query processors, limiting scalability, and imposing a single point of failure. To overcome those problems in this paper, we propose the computation and maintenance algorithms for spatial k-dominant skyline query processing in large-scale distributed environment. Where the underlying dataset is partitioned into geographically distant computing core (personal computer) that are connected to the coordinator (server). Our proposed techniques preserve the spatial k-dominant computation object itself into a serialized form. This preservation is done in client s core after completing a computational job successfully. When the issue of maintenance comes in action, preserve data object retrieves and use for computation. This procedure eliminates the necessity of intermediate re-send and re-computation of k-dominant skyline for the maintenance issue. Thus, we quantify the gain of data transferring consecutively into different cores to maximize the overall gain as well as the query or balancing the load on different cores fairly. Extensive performance study shows that proposed algorithms are efficient and robust to different data distributions. Keywords: Skyline, k-dominant Skyline, Load table, Maintenance 1 Introduction An user often wants to optimize their decision making and selection criteria across multiple attributes. In many cases, especially in multi-criteria decision making, there is no single best answer to a query. Skyline queries do not require an explicit preference function, which may be difficult for the user to define when the relative importance of the different criteria is vague. 244

2 International Journal of Networking and Computing ID Price Distance h h Distance h 1 Skyline h 2 h h h 3 h 5 h h 4 (a) Hotels (b) Skyline Price Figure 1: Skyline Example A skyline query retrieves a set of skyline objects so that the user can choose promising objects from them and make further inquiries. Skyline objects in a database are objects that are not dominated by any other objects in the database. Given an n-dimensional database DB, an object O i is said to be in skyline of DB if there is no other object O j (i j) in DB such that O j is better than O i in all dimensions. If there exists such O j, then we say that O i is dominated by O j or O j dominates O i. Figure 1 shows a typical example of skyline. The table in the figure is a list of hotels, each of which contains two numerical attributes: distance and price, for online booking. A user chooses a hotel from the list according to her/his preference. In this situation, her/his choice usually comes from the hotels in skyline, i.e., one of h 1, h 3, h 4 (see Figure 1 (b)). Therefore, such skyline query functions are important for several database applications, including customer information systems, decision support, data visualization, and so forth. A number of efficient algorithms for computing skyline objects have been reported in the literature [1, 2, 3, 4, 5]. In a high-dimensional dataset a skyline query often retrieves too many objects to investigate. To minimize the result and to find more important and meaningful objects, Chan et al. considered k-dominant skyline query [6]. They relaxed the definition of dominated so that an object is likely to be dominated by another. Given an n-dimensional database, an object O i is said to k-dominates another object O j (i j) if there are k (k n) dimensions in which O i is better than or equal to O j. A k-dominant skyline object is an object that is not k-dominated by any other objects. Until now the k-dominant skyline literature has focused on centralized databases. In practice, data are often collected from multiple sources that are geographically distant from each other. For example, assume a travel agency has numerous computing cores (data sources), each of which is located in a different city (New York, Los Angeles, Chicago, etc.), and manages only local properties. A query may demand the k-dominant skyline of the properties in specific source via coordinator (server). For this or similar type application, a distributed architecture is more suitable. This is because, compared to the centralized counterpart, it has smaller computation cost, it allows better local data management, smaller update cost, and higher tolerance to machine failures. In this paper, we study the efficient maintenance of spatial k-dominant skyline queries, where the underlying dataset is partitioned into geographically distant computing cores that are connected to the coordinator. Assume a system architecture, that consists of a server with a network access point, and several core connect to the server (Figure 2). The server can directly communicate with any core and request for k-dominant skyline computation. In such a setting, each time the server sends a k-dominant skyline request with new dataset or maintenance request to a core, all objects information that are not k-dominated have to be transferred to the server. Our previous work [16] suffers from data transmission point of view. The server needs to resend the dataset again and again for each update. For example, assume initially cores C 1, C 2, and C 3 are assigned to process three regional datasets R 1, R 2, and R 3 respectively. After completing the computational work when a 245

3 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation C2 C1 C3 C7 C4 C6 C5 Figure 2: System architecture core C 1, becomes free then the new dataset D 4 will be assigned to it. In our previous work [16], for computational complexity, C 1 did not store internal data structure related to D 1. Next, when dataset D 1 requires some maintenance work then the server needs to assign a new core to perform the same computation and maintenance operation respectively. This procedure consumes a lot of computing power. The proposed work overcomes that problem and the controlling server need not to resend the dataset again and again for each maintenance issue. In this approach data computation object itself is stored as binary logic object data in the corresponding computing core. This technique usually used to store image objects. As computation object itself is stored, we don t have to worry about the complexity of storing internal data structure related to a dataset. Computational objects are stored and can be restored again if necessary. Wu et al. [29] proposed distributed skyline query processing algorithm (DSL), to support skyline search on a horizontally partitioned dataset. DSL is not applicable to our problem, where the underlying database DB is horizontally partitioned onto the participating cores in an arbitrary manner. Specifically, we allow each core to contain any subset of DB, rather than only the data falling in a particular region. It is worth mentioning that the partitioning scheme adopted by DSL incurs expensive network overhead for maintenance DB. For example, assume that core C 1 receives an insertion to DB that, however, falls in the region assigned to core C 2. Since the object must be retained at core C 2, an object transfer (from core C 1 to core C 2 ) is required. In general, every insertion may be accomplished by an object transfer, which is prohibitively costly in practice. Similar thing will happen for delete as well as update operations. The proposed work does not suffer from these drawbacks, because each core can manage its local dataset without any network transmission. Here are the contribution of this paper: (1) We extend k-dominant computation from centralized to distributed datasets. We propose a framework for the maintenance of distributed spatial k-dominant skyline queries over multiple cores, aiming to reduce data transmission overhead. (2) We introduce three novel algorithms for distributed k-dominant skyline queries. The registration and ranking power calculation, and the job distributed algorithms work in server site. For client site we develop computation core selection algorithm. (4) Moreover, all computing cores are reliable, failure of one or more computing core does not create any problem, the rest of the computing cores are responsible to conduct the computational work. (5) We conduct extensive experiments to perform the evaluation. The evaluation results show that the proposed techniques are efficient and scalable. Our algorithms consistently outperform against its competitor multicore based spatial k-dominant skyline () in all examined setups. The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 presents the notions and properties of regional k-dominant skyline computation. We provide detailed 246

4 International Journal of Networking and Computing examples and analysis of our algorithm in Section 4. We experimentally evaluate our algorithm in Section 5 under a variety of settings. Finally, Section 6 concludes the paper. 2 Related Work Our work is motivated by previous studies of skyline query processing, k-dominant skyline query processing as well as distributed k-dominant skyline query processing, which are reviewed in this section. 2.1 Skyline Query Processing Borzsonyi et al. first introduced the skyline operator over large databases and proposed three algorithms: Block-N ested-loops(bn L), Divide-and-Conquer(D&C), and B-tree-based schemes [2]. BNL compares each object of the database with every other object, and reports it as a result only if any other object does not dominate it. A window W is allocated in main memory, and the input relation is sequentially scanned. In this way, a block of skyline objects is produced in every iteration. In case the window saturates, a temporary file is used to store objects that cannot be placed in W. This file is used as the input to the next pass. D&C divides the dataset into several partitions such that each partition can fit into memory. Skyline objects for each individual partition are then computed by a main-memory skyline algorithm. The final skyline is obtained by merging the skyline objects for each partition. Chomicki et al. improved BNL by presorting, they proposed Sort-F ilter-skyline(sf S) as a variant of BNL [4]. Among index-based methods, Tan et al. proposed two progressive skyline computing methods Bitmap and Index [7]. In the Bitmap approach, every dimension value of a point is represented by a few bits. By applying bit-wise AND operation on these vectors, a given point can be checked if it is in the skyline without referring to other points. The index method organizes a set of d-dimensional objects into d lists such that an object O is assigned to list i if and only if its value at attribute i is the best among all attributes of O. Each list is indexed by a B-tree, and the skyline is computed by scanning the B-tree until an object that dominates the remaining entries in the B-trees is found. The current most efficient method is Branch-and-Bound Skyline(BBS), proposed by Papadias et al., which is a progressive algorithm based on the best-first nearest neighbor (BF-NN) algorithm [5]. Instead of searching for nearest neighbor repeatedly, it directly prunes using the R*-tree structure. Recently, more aspects of skyline computation have been explored. Vlachou et al. introduce the concept of extended skyline set, which contains all data elements that are necessary to answer a skyline query in any arbitrary subspace [9]. Fotiadou et al. mention about the efficient computation of extended skylines using bitmaps in [2]. Chan et al. introduce the concept of skyline frequency to facilitate skyline retrieval in high-dimensional spaces [11]. Tao et al. discuss skyline queries in arbitrary subspaces [12]. 2.2 k-dominant Skyline Query Processing Chan et al. introduced k-dominant skyline query [6]. They proposed three algorithms, namely, One-Scan Algorithm (OSA), Two-Scan Algorithm (TSA), and Sorted Retrieval Algorithm (SRA). OSA uses the property that a k-dominant skyline objects cannot be worse than any skyline object on more than k dimensions. This algorithm maintains the skyline objects in a buffer during the scan of the dataset and uses them to prune away objects that are k-dominated. TSA retrieves a candidate set of dominant skyline objects in the first scan by comparing every object with a set of candidates. The second scan verifies whether these objects are truly dominant skyline objects or not. This method turns out to be much more efficient than the one-scan method. A theoretical analysis is provided to show the reason for its superiority. The third algorithm, SRA is motivated by the rank aggregation algorithm proposed by Fagin et al., which pre-sorts data objects separately according to each dimension and then merges these ranked lists [8]. The above skyline as well as k-dominant methods are designed for centralized database and do not work in the distributed environment. Several attempts have been made to address distributed 247

5 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation skyline retrieval, leading to a number of interesting methods [17, 14]. Balke et al. extended the skyline problem for the web in which the attributes of an object are distributed in different webaccessible servers. They assumes that a d-dimensional database is vertically partitioned into d lists, where each list contains the values of a dimension in ascending order. They also develop an enhanced solution that, instead of visiting the lists in a round-robin fashion, always accesses the most promising list, so that the least expansion is performed on each list to discover an anchor point p. Huang et al. [14] study skyline search on hand-held devices in a mobile ad-hoc network (MANET). Each device store a tiny horizontal fraction of the underlying relation, and is able to communicate only with its neighbors, i.e., devices within its contact range. The connection is ad-hoc because (i) two devices cease to be neighbors when they move out of each other s contact range, and (ii) new neighbors are formed when two devices come close to each other. The objective is to compute the skyline over the data of all the devices and report it at the querying device Q, by consuming as little bandwidth as possible. 2.3 Distributed k-dominant Skyline Distributed skyline computation has recently attracted considerable attention and has been studied in a variety of distributed systems, including web information systems [17], parallel systems [23], peer-to-peer systems [18, 2, 21, 24, 25, 27, 28, 29], mobile ad-hoc networks [26], as well as more generic distributed systems [19, 22]. Most of the existing approaches focus on highly distributed environments, such as peer-to-peer networks, assuming that all data sources store common attributes. Several approaches belong to this category, namely DSL [29], SSP [27], SKYPEER [24], and BITPEER [2]. DSL was proposed by Wu et al. [29] and it is the first paper that addresses constrained skyline query processing over disjoint data partitions by using a structured peer-to-peer overlay, namely CAN. Wang et al. [27] propose the SSP algorithm based on the use of a tree based peer-to-peer overlay (BATON) for assigning data to servers. SKYPEER [24] transforms the multidimensional data into one-dimensional values and utilizes a thresholding scheme in order to reduce the transferred data. Fotiadou et al. [2] propose BITPEER for supporting efficiently continuous subspace skylines in a distributed setting by using distributed bitmap indexes. However, unfortunately none of the above methods are designed for distributed k-dominant skyline computation. This is because the maintenance of k-dominant skyline for an update is much more difficult due to the intransitivity of the k-dominance relation. Assume that A k-dominates B and B k-dominates C. However, A does not always k-dominate C. Moreover, C may k-dominate A. Because of the intransitivity property, we have to compare each object against every other object to check the k-dominance. Therefore in [15], Siddique and Morimoto, resolved k-dominant skyline maintenance problem in a centralized dataset. We proposed M SKS [16] algorithm for spatial k-dominant skyline computation and maintenance. The server needs to resend the dataset again and again for each update. In this paper, we expand the k-dominant skyline queries to spatially distributed datasets and resolve the dataset resending problem. 3 Preliminaries This section discusses the distributed k-dominant skyline computation, maintenance problems, and associated properties. Our objective is to extract the local k-dominant skyline and deliver it to a server computer CR. We refer CR as the coordinator of the k-dominant skyline retrieval. Assume there are x regional datasets named R 1, R 2,, R x and y computing cores called C 1, C 2,, C y. CR can be any computer other than C 1, C 2,, C y. CR is able to initiate communication with each core directly. Assume each dataset is n-dimensional and D 1, D 2,, D n are the n attributes of each regional dataset. Let O 1, O 2,, O r be r objects (tuples) of a regional dataset, say R p (1 p x). We use O i.d j to denote the j-th dimension value of O i. 248

6 International Journal of Networking and Computing 3.1 k-dominance An object O i is said to dominate another object O j, which we denote as O i O j, if O i.d s O j.d s for all dimensions D s (s = 1,, n) and O i.d t < O j.d t for at least one dimension D t (1 t n). We call such O i as dominant object and such O j as dominated object between O i and O j. By contrast, an object O i is said to k-dominate another object O j, denoted as O i k O j, if O i.d s O j.d s in k dimensions among n dimensions and O i.d t < O j.d t in one dimension among the k dimensions. We call such O i as k-dominant object and such O j as k-dominated object between O i and O j. An object O i is said to have δ-domination power if there are δ dimensions in which O i is better than or equal to all other objects of R p. 3.2 Spatial k-dominant Skyline An object O i R p (1 p x) is said to be a spatial skyline object of R p if O i is not dominated by any other object in R p. Similarly, an object O i R p is said to be a spatial k-dominant skyline object of R p if O i is not k-dominated by any other object in R p. We denote a set of all spatial k-dominant skyline objects in R p as Sky k (R p ). Note that objects that have k-domination power must be k-dominant skyline objects but not vice versa. 4 Distributed Spatial k-dominant Skyline In this section, we present our algorithms for job distributing and computing spatial k-dominant skyline objects and maintaining the result when update occurs. There is a crucial difference between our network architecture and those in [14, 9, 29]. The architecture in [14, 9, 29] aim at supporting massive distributed environments with a huge number of data machines. In our scenario the cores number is moderately small. Our job distribution algorithms are useful for efficient k-dominant skyline retrieval in distributed environment. Job distribution is important for three reasons. (i) Efficient distributing method enhances user experience by preventing a long idle period. (ii) It allocates the best computing core to the largest regional dataset. (iii) It provides core allocation guarantee for every regional dataset. We illustrate job distributing technique among multiple cores in Section 4.1. Then show how to compute spatial k-dominant skyline for all k at a time in Section 4.2. Section 4.3 presents three types of maintenance solution. They are insertion, deletion, and update operation. The distributed system is designed in such a manner for enhanced computation speed. If a user tries to compute in single core, it will consume a lot of computing time as well as computing power. The user may complain about the overhead of transmitting full dataset each time a new computation is initiated. The proposed techniques do not suffer from transmission issue. 4.1 Job Distribution If the number of regional dataset is equal to or smaller than the available computing core (i.e., x y) then there is no issue to consider for job distribution. However, when x > y, (i.e., the number of regional dataset is larger than the available computing core) job distribution issues occur. We propose three algorithms for job distribution. They are: registration and ranking power calculation, job allocation, and computation core selection. Among the algorithms registration and ranking power calculation and job allocation algorithms work in server site and the remaining computation core selection algorithm works in client site. Registration and Ranking Power Calculation Suppose there are y computing cores called C 1, C 2,, C y for distributed spatial k-dominant skyline computation. Every node needs to register itself to the coordinator CR. The CR has an independent thread for registering and ranking computing cores. A core (say C m ) sends a registration request 249

7 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation Algorithm 1: Registration and Ranking Power Calculation Create LoadTable(ID, IP Address, Port no., Computing power, Memory size, Network traffic, Rank power, Status) 1. Start and wait 1ms for a registration request. 2. If new request arrives then i. Assign a new ID I m for requesting core C m. ii. Store C m s C m (IP ), C m (CP ), C m (M) in standard database iii. Set I m status as FREE. Else i. Select all cores whose status are not BUSY (i.e., either FREE or DEAD ). ii. For each core C m do the following steps: a) Ping each C m (IP ) 4-6 times and keep average response time C m (RT ). b) If C m (RT ) = infinity, i.e., destination host is unreachable, then Set C m status as DEAD and continue. c) Calculate C m ranking power, RP m = 6 * C m (CP ) + C m (M) +(C m (RT ) 2 ). d) Store RP m in LoadTable. 3. Go to step 1. to CR. Then the independent thread of CR creates a load table named LoadTable(ID, IP Address, Port no., Computing power, Memory size, Network traffic, Rank power, Status). ID can be any value between 1 to y. Each core status can be either FREE, BUSY or DEAD. Computing power C m (CP ), memory size C m (M) are supplied by corresponding core during the registration procedure. method periodically monitors the traffic status during RP m calculation. The average ping time or response time C m (RT ), from controlling core CR to C m uses as an indication of traffic status. If the network is busy, the average ping time will be high. Again if C m is disconnected then the average ping time becomes infinity. In such situation C m is marked as a dead core, i.e., core status field of LoadTable for C m is set as DEAD. We consider three parameters such as computing power, memory size, and network traffic to calculate RP m of C m. C m (CP ) and C m (M) are provided by C m during the registration period and are stored in CR s LoadTable. Rank power RP m indicates the computing power of each individual core and is calculated using formula: RP m = 6 * C m (CP ) + C m (M) +(C m (RT ) 2 ). In each ideal time (after waiting for new request up to a certain period of time), CR continuously computes the ranking power of cores whose status are not BUSY. In all other situations the proposed technique calculates RP m and stores it in LoadTable. Registration and ranking power calculation algorithm is shown in Algorithm 1. Job Allocation The next challenging task is distributing computation job among cores. To do this CR creates a QueueTable(Data id, Data location, Job type, Job status, Processing core ID, Processing time). Data id can be any value such as A, B, C, etc. Data location is the location of the dataset. Job type can be either New or Maintenance. Job status can be set as either Unprocessed or Processed. Processing core ID is the id of the participating core. Processing time is the completion time of the assigned job. A job request may have two different operations: completely new processing (i.e., Spatial k-dominant skyline computation) or a maintenance operation. If the job request is completely new then the type field in QueueTable populate with New otherwise set Maintenance. When CR receives a new job. It places the job in queue and sets job data status as Unprocessed. In CR an independent thread runs continuously to search the queue in order to find whether or not any new job is being submitted. If a job or a set of jobs has status Unprocessed then the thread starts parsing other field values for each new job. Assume a new request for dataset R p has arrived in queue. CR needs to find a suitable core to assign the computation responsibility. If CR receives New request for the dataset R p then, it searches LoadTable to find a free core say, C m with status FREE to do the job. It is possible to have more than one core with FREE status. In such situation it selects the core C m with highest Rank power. After that it initiates a new child thread at CR to consult the computation process with C m. In the child thread, C m status 25

8 International Journal of Networking and Computing Algorithm 2: Job Allocation Create LoadTable (ID, IP Address, Port no., Computing power, Memory size, Network traffic, Rank power, Status) QueueTable(Data id, Data location, Job type, Job status, Processing core ID, Processing time) 1. Search QueueTable for all entries with job status Unprocessed. 2. If found then for all available dataset do the following steps a. Read the job type. b. If the job type is New then i. Search LoadTable for a core C m with status FREE. ii. If C m found then set Status = BUSY Initiate new thread for assign computation: # Make RMI call to do the computation. # Receive the result. # Store result into a standard database. # Update queue, set core ID = C m and Job status = Processed. # Mark C m Status = FREE. iii. Else continue c. Elseif the job type is maintenance then i. Search QueueTable for the core C q who processed the dataset last time. ii. If C q found then Search LoadTable to get the current C q status. If C q Status = FREE then # Update LoadTable, set C q Status = BUSY. # Initiate the maintenance job: (.) Make RMI call to do the computation. (.) Receive result. (.) Store result into standard database. (.) Update queue, set core ID = C m and Job status = Processed. (.) Mark C q Status = FREE. Else continue Else continue 3. Go to step 1. is changed from FREE to BUSY. CR transfers R p to C m, and waits for result. Waiting period may vary depending on the size of R m. After receiving the computation result, the resultant data is being preserved in database and C m status is marked as FREE. The computation time and computing core ID are also stored in CR s data storage system. In case of Maintenance request CR searches QueueTable for the core say, C q who has processed R p most recently. After finding the core C q, CR searches LoadTable to obtain the information on current status of C q. If C q is available i.e., FREE then CR initiates necessary steps to send the maintenance request to C q and waits for result. After receiving the result, it updates the local database. However if C q is not free, CR has to wait until it becomes FREE. During the time of initiating a maintenance job to core C q, LoadTables status field is changed from FREE to BUSY. Meanwhile the dataset R p status field in QueueTable is set to Processed. Algorithm 2 represents the job allocation algorithm. Populating Load and Queue Table This section discusses about how to populate LoadTable and QueueTable. Assume two computing core (core 1 and core 2) send their registration request to CR. These requests are registered to LoadTable as shown in Table 1. Suppose at a certain period of time CR receives three data processing requests. Among the requests two for new data processing and remaining one for maintenance. At this stage the QueueTable becomes Table 2. From QueueTable CR finds some jobs with Unprocessed status. LoadTable shows that both cores have FREE status and core 1 has 251

9 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation Table 1: LoadTable after complete the registration task ID IP Add. Port no. Comp. Power Mem. Size Net. traffic Rank Power Status GHz 248 MB 6 ms FREE GHz 248 MB 3 ms 26. FREE Table 2: QueueTable after receives three jobs Data id Data Location Job type Job status Processing core ID Processing time A C:\Data\a.csv New Unprocessed NA NA B C:\Data\b.csv New Unprocessed NA NA A C:\Data\x.csv Maintenance Unprocessed NA NA the highest ranking power. Then CR assigns the job with data id A to core 1 and data id B to core 2, and updates LoadTable and QueueTable accordingly (see Tables 3 and 4). Since now no core has FREE status, no new job initialization is possible until any one becomes FREE. Suppose after 15 seconds, computation process of data A has been completed. CR preserves the result. Updated LoadTable and QueueTable are shown in Tables 5 and 6, respectively. Finally, the maintenance job upon dataset A is assigned to core 1. Similar procedure will repeat for large number of new dataset computation as well as maintenance. Computation Core Selection The most challenging task is to develop a mechanism for clients end computation. In this approach, client s core, say, C c, begins its operation by issuing a participation registration request to CR. After that C c waits until any computation request is received. As we mentioned previously, there exist two types computation requests: New and Maintenance. If C c receives a New request, it simply creates an object of computation class and initiates the spatial k-dominant skyline computation. After performing the calculation, C c preserves resultant data and some internal status data of computing object for future maintenance operation. Preserving only resultant and status data may cause some difficulties to re-initiate k-dominate computation as well as maintenance. To overcome those problems we preserve the computing object itself instead of resultant and status data. C c serializes the computing object, and preserves the object as Binary Logic Object Data (BLOD) into a database of file system. Then it returns the result to CR and waits for a new job request. In case of receiving a Maintenance computation request for regional R p dataset, C c searches its local database or file system for the computation object whose dataset match with R p. It should be found, because CR sent such request upon finding that C c has already completed such computation in recent time. C c de-serializes the BLOD data and constructs computation object. After performing required maintenance operation, C c sends back the resultant data and again preserves the serialized form of computation object. Computation core selection algorithm is shown Algorithm Spatial k-dominant Skyline Computation Assume there is a regional dataset R p (1 p x) as shown in Table 7. In order to prune unnecessary objects efficiently in the k-dominant skyline computation, we compute domination power of each Table 3: LoadTable after assign the jobs between two cores ID IP Add. Port no. Comp. Power Mem. Size Net. traffic Rank Power Status GHz 248 MB 6 ms BUSY GHz 248 MB 3 ms 26. BUSY 252

10 International Journal of Networking and Computing Table 4: QueueTable after assign the jobs Data id Data Location Job type Job status Processing core ID Processing time A C:\Data\a.csv New Processing 1 NA B C:\Data\b.csv New Processing 2 NA A C:\Data\x.csv Maintenance Unprocessed NA NA Table 5: LoadTable after completing the job of core 1 ID IP Add. Port no. Comp. Power Mem. Size Net. traffic Rank Power Status GHz 248 MB 6 ms FREE GHz 248 MB 3 ms 26. BUSY Table 6: QueueTable after completing the job of core 1 Data id Data Location Job type Job status Processing core ID Processing time A C:\Data\a.csv New Processed 1 15 B C:\Data\b.csv New Unprocessed 2 NA A C:\Data\x.csv Maintenance Unprocessed NA NA Algorithm 3: Computation Core Selection Standard database management system (MySQL) 1. Each core with Status = FREE sends computation request to CR. 2. If new processing request arrives from CR then a. Check the job type. b. If the job type is New then i. Initiate new computation class object. ii. Perform spatial k-dominant skyline computation. iii. Serialize the computation object. iv. Preserve the computation object into standard database. v. Return result to the CR. c. Elseif the job type is Maintenance Search local database management system for processing object i. If found then # De-serialize the stored object to make it alive. # Perform the maintenance operation. # Serialize the computation object. # Replace the previous serialized object with the new one. # Return result to CR. ii. Else return ERROR: OBJECT NOT FOUND. 4. Return to step

11 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation Table 7: Symbolic Regional Dataset, R p Obj. D 1 D 2 D 3 D 4 D 5 D 6 O O O O O O O Table 8: Ordered Domination Table Obj. D 1 D 2 D 3 D 4 D 5 D 6 DP Sum DC IDX O O 5 O O 7 O O 7 O O 7 O O 7 O O 7 O O 7 object. We sort objects in descending order by domination power. If more than one objects have the same domination power then sort those objects in ascending order of the sum value. This order reflects how to k-dominate other objects. Table 8 is the sorted object sequence of Table 7, in which the column DP is the domination power and the column Sum is the sum of all values. In the sequence, object O 7 has the highest domination power 4. Note that object O 7 dominates all objects below it in four attributes, D 1, D 2, D 5, and D 6. After computing the sorted object sequence, we compute dominated counter (DC) and dominant index (IDX). Detailed algorithm can be found in [15]. The dominated counter (DC) indicates the maximum number of dominated dimensions by another object in R p. The dominant index (IDX) is the strongest dominator. That means IDX keeps the record of the corresponding strongest dominator for each object. The columns DC and IDX of Table 8 show the result of the procedure. Sky k (R p ) is a set of objects whose DC is less than k. According to the dominated counter, we can see that Sky 6 (R p ) = {O 7, O 5, O 2, O 6, O 3 }, Sky 5 (R p ) = {O 7, O 5 }, and Sky 4 (R p ) = {O 7 }. Since there is no object whose DC value is less than 3, thus Sky 3 (R p ) = { }. 4.3 Spatial k-dominant Skyline Maintenance In this section, we discuss the maintenance problem of Sky k (R p ) after an update has occurred in R p. In the maintenance phase, we keep a vector that contains the minimal value for each dimension, which we call the minimal vector. The minimal vector of Table 8 is {1, 2, 1, 1, 1, 2}. We also keep an inverted index of the dominant index column (IDX). Table 9 is the inverted index of Table 8. Lemma 1: Assume O 1.D s O 2.D s for all dimensions D s (s = 1,, n) for O 1, O 2 R p. If O 1 is not deleted from R p, O 2 will never be in the k-dominant skyline of R p. Proof: Since O 2 is n-dominated by O 1, it is also k-dominated by O 1 because (k n). Therefore, while O 2 is in the R p, there will be at least one object, namely O 1, that k-dominates it and consequently cannot become a k-dominant skyline. It implies that only Sky n (R p ) objects are relevant for the k-dominant skyline maintenance task. 254

12 International Journal of Networking and Computing Table 9: Inverted Index Obj. dominated O 5 O 7 O 7 O 5, O 4, O 2, O 6, O 3, O 1 Table 1: Ordered Domination Table after Insertions Obj. DP Sum DC IDX Obj. DP Sum DC IDX O O 5 O O 5 O O 7 O O 7 O O 7 O O 7 O O 7 O O 7 O O 7 O O 7 O O 2 O O 7 O O 7 O O 1 O O 7 O O 7 O O 7 O O 7 O O 7 Insertion Assume O I is inserted into R p. We maintain Sky k (R p ) by examining the dominated counter (DC) by scanning the sorted object sequence. If the DC value of an object is updated, we also update the inverted index. During the update procedure, if O I is n-dominated by another object, then we can stop the procedure immediately. After updating DC and IDX, we insert O I into the sorted object sequence. We use a skip list data structure for the sequence so that we can insert efficiently. Assume we insert O 8 = (6, 6, 6, 6, 6, 6) in the running example. By comparing with the first object O 7, we can find that O 8 is 6 dominated by O 7. Therefore, we immediately complete the update of DC and IDX. Then, we insert O 8 into the last position of the sorted sequence. Next, we insert O 9 = (3, 4, 4, 2, 2, 3). O 9 is not 6 dominated by any object and we can find that the strongest dominator of O 9 is O 2. Then, DC and IDX are updated as in Table 1 after inserting O 9. Next, we insert O 1 = (2, 2, 3, 1, 2, 3). O 1 is not 6-dominated by any object and the strongest dominator of O 1 is O 7. Moreover, O 1 becomes the strongest dominator of O 9. Then, DC and IDX are updated as in Table 1 after inserting O 1. Deletion Assume O D is deleted from R p. We check the inverted index to examine whether there is O D s record in the index. Note that in Table 9 Obj. column in each record is the strongest dominator for objects in the dominated column. Therefore, if there is no O D s record in the inverted index, we do not have to change the dominated counter (DC) for other objects. Otherwise, we initialize DC for each object in the O D s index record. Then, we perform the pairwise comparison. In this case, we do not have to examine DC for objects that are not in the O D s index record and therefore it is not costly. Consider the running example again. Assume we delete O 1. We examine DC of O 9 by scanning Table 1. The updated result is Table 1. Update In this section, we propose maintenance solution for update operation. Although one can handle update operation with consecutive deletion and insertion. However, single update operation is usually faster than consecutive delete and insert operation. Assume O U is updated from R p. Then, 255

13 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation Table 11: Ordered Domination Table after Updates Obj. DP Sum DC IDX Obj. DP Sum DC IDX O O 5 O O 4 O O 7 O O 7 O O 7 O O 5 O O 7 O O 5 O O 7 O O 4 O O 3 O O 5 O O 7 O O 4 Table 12: Response time for heterogeneous cores Core Type Time (ms) Single Core Core Core i Core i Core i same as delete operation, we have to check the inverted index to examine whether there is O U s record in the index. If there is no O U s record in the inverted index, then we need one scan to revise the dominated counter (DC) for updated object as well as all other data objects. Otherwise, we initialize DC for each object from scratch. However, in this situation we argue that compared with non-idx objects, the number of IDX objects is very few and therefore update operation is also not very costly. Consider Table 8 and select a non-idx object such as O 3 for update. Assume we update D 5 value of this object from 5 to 1. Then after dominance checking with other objects we find that O 3 becomes the strongest dominator of object O 6. This update procedure is shown in Table 11. Again from Table 8 if we select an IDX object such as O 7. In this case, we update the values of D 1 and D 5 from 1 to 6. The revised result after this update is shown in Table Performance Evaluation We conduct a series of experiments to evaluate the effectiveness and efficiency of our proposed methods. There are no other existing techniques that deal with the problem of maintaining k- dominant skyline. So in this paper, we compare our methods against M SKS, which was proposed in [16]. Our heterogeneous simulation takes five different cores (PCs) that are connected to a coordinator (server). Processors of those cores are as follows: single core, core2, core i3, core i5, and core i7. They have 3GB main memory. We choose independent distribution with 6 dimensionality and 1k data cardinality to evaluate each core performance. Table 12 shows the computation time of each core. We observe that high power computing cores take less time than those of low power cores. This result establishes the necessity of rank power computation. Next, our homogeneous simulation takes similar type of cores. The number of cores is moderately small (from 2 to 8). Each of the cores has an Intel(R) Core2 Duo 2 GHz CPU and 3GB main memory. We evaluate our algorithms against M SKS using different types of datasets. As benchmark, we use the datasets proposed by Borzsony et al. [2], in which there are three types of synthetic data distributions: correlated, anti-correlated, and independent. The generation of the synthetic datasets is controlled by three parameters, n, Size, and Dist, where n is the number of attributes, Size is the total number of objects in the dataset, and Dist can be any of the three distributions. In addition, we have generated smaller synthetic datasets for all insertion experiments. For example to conduct insertion experiment on 1k synthetic dataset, we have also generated additional 1k dataset. As for delete and update operations, we choose random object from the experimental dataset. 256

14 International Journal of Networking and Computing 5.1 Spatial k-dominant Computation Varying Core Number We first study the effect of spatial k-dominant computation varying core number on our techniques. We use anti-correlated, independent, and correlated datasets with dimensionality n to 7, cardinality to 1k, and k to 6. The number of cores varies from 2 to 8. Figure 3(a), (b), and (c) shows the time to compute spatial k-dominant skyline. Figure 3 shows that the performance of proposed methods outperform than M SKS for all data distribution. 5.2 Effect of Data Distribution To study the effect of data distributions we use anti-correlated, independent, and correlated datasets with dimensionality n to 7, cardinality to 1k, k to 6, and core number to 8. figure 4(a), (b), and (c) shows the time to maintain spatial k-dominant skyline for the maintenance ranges from 1% to 5%. In this experiment, 1% maintenance for 1k dataset implies that there are 333 insertions, 333 deletions, and 334 updates (total 1k updates) occurred in the dataset. Figure 4 shows that the performance of both methods deteriorates significantly with the increase of maintenance size. We further notice that for anti-correlated dataset, many spatial k-dominant skyline objects retrieved as a result the maintenance cost of this distribution incurs high computational overheads. 5.3 Effect of Dimensionality For this experiment, we choose the anti-correlated datasets and the core number is 8. We set the data cardinality to 1k and vary dataset dimensionality n ranges from 4 to 8 and k from 3 to 7. Figure 5(a), (b), and (c) shows the update performance. The M SKS technique is highly affected by the curse of dimensionality, i.e., as the space becomes sparser its response time rapidly decreases. Figure 5 shows that if the dimensionality and the maintenance ratio increase the response time of proposed methods grows steadily, which is much less than that of. 5.4 Effect of Cardinality For this experiment, we select the anti-correlated datasets with varying dataset cardinality ranges from 1k to 2k, set the value of n to 7, k to 6, and core number to 8. Figure 6(a), (b), and (c) shows the time to maintain k-dominant skyline for maintenance ranges from 1% to 5%. The result shows that if the maintenance ratio and data cardinality increase the response time of proposed methods also increases. However, the response time is much smaller than that of M SKS. 6 Conclusion k-dominant skyline has side effect of taking larger time for centralized computation. Centralized system also has maintenance problem. To solve those problems, we consider regionally distributed k-dominant skyline and present an efficient architecture for computing the query result as well as the maintenance problem. We introduced registration and ranking power calculation, job allocation, and computation core selection algorithms for efficient load distribution and shortening the response time. Our algorithms consistently outperform against M SKS algorithm. Intensive experiments show the efficiency and scalability of our proposed methods. However, current architecture can handle small number of cores. We leave the work to support large number of cores for future research. References [1] T. Xia, D. Zhang, and Y. Tao, On Skylining with Flexible Dominance Relation, in Proceedings of ICDE, 28, pp

15 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation Core Number a) Correlated Core Number b) Anti-correlated Core Number c) Independent Figure 3: Computation performance for different core number 258

16 International Journal of Networking and Computing % 2% 3% 4% 5% Maintenance a) Correlated % 2% 3% 4% 5% Maintenance b) Anti-correlated 1% 2% 3% 4% 5% Maintenance c) Independent Figure 4: Maintenance performance for different data distributions 259

17 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation % 2% 3% 4% 5% Maintenance a) 4 dimension % 2% 3% 4% 5% Maintenance b) 6 dimension % 2% 3% 4% 5% Maintenance c) 8 dimension Figure 5: Maintenance performance for different dimensions 26

18 International Journal of Networking and Computing % 2% 3% 4% Maintenance a) 1k % 2% 3% 4% 5% Maintenance b) 15k % 2% 3% 4% 5% Maintenance c) 2k Figure 6: Maintenance performance for different data cardinality 261

19 Distributed Spatial k-dominant Skyline Maintenance Using Computational Object Preservation [2] S. Borzsonyi, D. Kossmann, and K. Stocker, The skyline operator, in Proceedings of ICDE, 21, pp [3] D. Kossmann, F. Ramsak, and S. Rost, Shooting stars in the sky: An online algorithm for skyline queries, in Proceedings of VLDB, 22, pp [4] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang, Skyline with Presorting, in Proceedings of ICDE, 23, pp [5] D. Papadias, Y. Tao, G. Fu, and B. Seeger, Progressive skyline computation in database systems, in ACM Transactions on Database Systems, vol. 3(1), pp , March 25. [6] C. Y. Chan, H. V. Jagadish, K-L. Tan, A-K. H. Tung, and Z. Zhang, Finding k-dominant Skyline in High Dimensional Space, in Proceedings of ACM SIGMOD, 26, pp [7] K.-L. Tan, P.-K. Eng, and B. C. Ooi, Efficient Progressive Skyline Computation, in Proceedings of VLDB, 21, pp [8] R. Fagin, A. Lotem, and M. Naor, Optimal aggregation algorithms for middleware, in Proceedings of ACM PODS, 21, pp [9] A. Vlachou, C. Doulkeridis, Y. Kotidis, and M. Vazirgiannis, SKYPEER: Efficient Subspace Skyline Computation over Distributed Data, in Proceedings of ICDE, 27, pp [1] K. Fotiadou and E. Pitoura, BITPEER: Continuous Subspace Skyline Computation with Distributed Bitmap Indexes, in Proceedings of DaAMaP, 28, pp [11] C. Y. Chan, H. V. Jagadish, K-L. Tan, A-K. H. Tung, and Z. Zhang, On High Dimensional Skylines, in Proceedings of EDBT, 26, pp [12] Y. Tao, X. Xiao, and J. Pei, Subsky: Efficient Computation of Skylines in Subspaces, in Proceedings of ICDE, 26, pp [13] W.-T Balke, U. Guntzer, and J. X. Zheng, Efficient Distributed Skylining for Web Information Systems, in Proceedings of EDBT, 24, pp [14] Z. Huang, C. S. Jensen, H. Lu, and B. C. Ooi, Skyline Queries Against Mobile Lightweight Devices in Manets, in Proceedings of ICDE, 26, pp. 66. [15] M. A. Siddique and Y. Morimoto, Efficient Maintenance of all k-dominant Skyline Query Results for Frequently Updated Database, in The International Journal on Advances in Software, Vol. 3, No. 3 & 4, 21, pp [16] M. A. Siddique, A. Zaman, M. M. Islam, and Y. Morimoto, Multicore based Spatial k- dominant Skyline Computation, in Proceedings of the 3rd Int l Conference on Networking and Computing (ICNC), 212, pp [17] W.-T. Balke, U. Guntzer, and J. X. Zheng, Efficient distributed skylining for web information systems, in Proceedings of EDBT, 24, pp [18] L. Chen, B. Cui, H. Lu, L. Xu, and Q. Xu, isky: Efficient and progressive skyline computing in a structured P2P network, in Proceedings of ICDCS, 28, pp [19] B. Cui, H. Lu, Q. Xu, L. Chen, Y. Dai, and Y. Zhou, Parallel distributed processing of constrained skyline queries by filtering, in Proceedings of ICDE, 28, pp [2] K. Fotiadou and E. Pitoura, BITPEER: Continuous subspace skyline computation with distributed bitmap indexes, in Proceedings of DAMAP, 28, pp [21] K. Hose, C. Lemke, and K.-U. Sattler, Processing relaxed skylines in PDMS using distributed data summaries, in Proceedings of CIKM, 26, pp

Multicore based Spatial k-dominant Skyline Computation

2012 Third International Conference on Networking and Computing Multicore based Spatial k-dominant Skyline Computation Md. Anisuzzaman Siddique, Asif Zaman, Md. Mahbubul Islam Computer Science and Engineering