Distributing the Derivation and Maintenance of Subset Descriptor Rules
|
|
- Nelson Burke
- 5 years ago
- Views:
Transcription
1 Distributing the Derivation and Maintenance of Subset Descriptor Rules Jerome Robinson, Barry G. T. Lowden, Mohammed Al Haddad Department of Computer Science, University of Essex Colchester, Essex, CO4 3SQ, U.K. Abstract This draft paper describes a solution to the rule maintenance problem for data descriptor rules derived from data that may subsequently change. The method utilises any available computers in the local area network, to derive and maintain rule sets. Introduction Database query processing involves the selection and manipulation of data subsets specified by the query or by the query processor. Descriptors for data subsets are useful in optimising the query processing task. For example, histograms are simple subset descriptors. Each bar in the histogram describes a data subset by specifying the number of data items in that subset. This descriptor information is used in conventional query optimisation to schedule the order of operations on intermediate data sets in the query execution plan. Attribute pair rules [14] are subset descriptors which state dependencies between data values within subsets. Rules of this kind are the basis of semantic query optimisation [13, 17, 18] and can also be used to support data caching in remote clients to a database management system [20]. The problem for applications using subset descriptors is that any change to data may require a corresponding change to the description of one or more of the data subsets. This implies the need for fast derivation and maintenance of subset descriptors, in a way that does not add workload to the database server. We investigate the use of multiple workstations in the same local area network as the data server to handle the work of descriptor derivation and maintenance. Tasks are distributed to these computers by a particular workstation (the master in a master-slave configuration). The tasks run on each slave workstation as background programs. 2. Background Information A subset descriptor is a selector, descriptor pair. The selector is a data-value constraint (a selection condition) which identifies a subset of data items in the database. The descriptor provides information about data items in that subset. The selector uses a collection of values and ranges for specified attributes as a selection constraint to include tuples in the subset. It may be the Boolean expression in the WHERE clause of an SQL query, for example. Attribute Pair (AP) rules [14] have been identified as a particularly useful form of subset descriptor for semantic query optimisation (SQO) and remote cache management. An AP rule has the form A B, where A and B both have of the structure of query selection conditions, but consequent B also describes tuples selected by condition A. For example, c(10..20) d(27..36) means in the set of tuples which satisfy the condition (10 c < 20) for attribute c, all have attribute d values in the range This apparently simple rule structure hides detail, since an AP rule set refers to a particular database table which may be a virtual relation, containing the pair of attributes. For example, the rule: ship_class(class, type, draught, _,_) ship(_, class, type, status, _,_) (draught < 50) (status = Active ) 1
2 is the AP rule: (draught < 50) (status = Active ) on a virtual relation which is the natural join of base relations ship_class and ship. Since a whole set of such AP rules is associated with the table it is inappropriate to repeat information in every rule. Sets of AP rule subset descriptors are derived from the data in preparation for query processing. Each AP rule is an ordered pair of conditions, which allows rules to be used as directed edges in a graph [13] whose paths provide transitive inferences which are further descriptions of the subset selected by the start of sub-path condition. AP rules can be extended to multi-consequent rules [13] without losing this graph edge semantics. Eg, the descriptor: c(10..20) d(27..36) (f {'OMG', 'TPC'}) h( ) represents three AP rules, all with the same antecedent condition, c(10..20). A single antecedent look-up in the rule set thus provides multiple assertions about the selected subset. Reducing look-up time for descriptors is important in query optimisation, since the goal is reduced query execution time. The consequent is a vector of assertions, so that descriptors for different subsets are easily compared or combined by specific vector element. For example, a database query selects data items with a(75..90) AND c(10..15), and two relevant descriptors exist which describe subsets containing the required data items: a(70..95) d(18..43) (f { ODMG, 'OMG'}) h(13..71) c(10..20) d(27..36) (f {'OMG', 'TPC'}) h( ) The query conditions are sub-ranges of the antecedent conditions. Each antecedent therefore selects a superset of that selected by the corresponding query condition. Furthermore, the conjunction in the query specifies the intersection of the two sets described by the two rules. Pairwise comparison of vector elements in the two rule consequents show incompatible values for attribute h. This indicates that the two query conditions select disjoint subsets of data items. No tuples can satisfy both query conditions so the result set will be empty, and the empty answer can be returned immediately, without consulting the database. In 'Associative Caching' [4, 5, 8] each client computer keeps a copy of each of its query result sets in its own local database. The purpose is to reduce the size and frequency of queries to the remote data server accessed by wide-area network. This can reduce query cost factors based on access restrictions imposed by the server such as authorisation delays, payment charges for data, and server breakdown or workload delay as well as internet delays. For each new query the client tries to find some or all of the required data in its local collection of result sets. Usually this is done by syntactic comparison of the new query with each previous query [e.g. 2, 11] to detect overlapping data sets. Attribute-Pair Range Rules which the server derives from its data for its own use in query optimisation, can be further utilised to provide descriptors for each query result set. This new information adds to the limited description currently available to a client in the form of the previous query expression. It enables clients to recognise data overlap for new queries which refer to attributes not mentioned in the previous query, so that local data can now be exploited for syntactically unrelated queries [15]. Subset descriptors are a form of knowledge about the data, which is derived directly from the data. But unlike many forms of KDD it must be exact [10] rather than probability-based. This means it cannot use only samples of data sets. It must process all tuples in the subset it describes. Therefore the use of subset descriptors can introduce a significant processing workload. But the data server should not be required to do extra work of this kind, since it may delay current queries. Creating metadata to make future queries faster would make current queries slower. Furthermore, data may change (in environments other than the static data warehouse or data archive) and this requires a corresponding change to descriptors. So it 2
3 would be useful to find an existing hardware resource that can be used to do this work instead of the data server. 3. Creating a Parallel Virtual Machine from Networked Computers PVM [6] is a well-established software system which enables a group of workstations linked by local-area network to work together as a Virtual Machine. Modern workstations have more computing power than they use. Successive generations of workstations increase the computing power and capacity of previous generations. Therefore the amount of spare computing capability in a network of desktop machines is steadily increasing. This is a resource that can be used to analyse and summarize data sets. PVM allows networked workstations to be used by spawning new background programs on the machines. The programs can accept messages from a main process/program on a particular computer telling them what to do and can return the specified results. Data sets can be transferred directly between machines, or via the Network File System. This allows the data server to be treated as one component in a multi-workstation machine. The processing and memory resources of the machine can expand and contract dynamically by varying the number of computers being used. The task of deriving subset descriptors from data can be distributed to multiple workstations in the local area network, as follows. Main Process Identify table and relevant attributes Retrieve the database table data Establish the PVM machine Sub-processes on different computers Receive the workload from the main process Send the same amount of data to each slave Sort the records according to a specific attribute Derive the rules Wait for the rules Receive and merge the rule subsets Send the derived rules to the main process Fig. 1 Using Multiple workstations to Create a set of Subset Descriptor Rules 3
4 4. Rule Derivation Algorithm Used 1. The Main Process chooses an attribute to be antecedent for the current set of rules, and identifies the MIN and MAX values if it is numeric. It broadcasts those values to all computers, and the message also specifies the number of rules required in the set. 2. Each computer then receives from the Main Process a subset of the database table to be described, and sorts it on the attribute specified as antecedent. 3. After sorting, each computer divides the MIN..MAX range into the specified number of sub-ranges. This is the number of rules required, since each sub-range produces a rule. 4. Each computer divides its sorted table (part of the original database table) into disjoint subsets, using the sub-ranges to select tuples. The ordered sequence of tuples is scanned, building each rule incrementally. For example, if the next sub-range is 10 a < 25 for the antecedent attribute named a then all tuples in the relevant sub-sequence of tuples will contribute to the rule. If the first tuple in the sub-set has 26 as the value of attribute c, then the rule so far is (10 a < 25) (c = 26). Descriptors for other consequent attributes are added in the same way. The next tuple in the ordered sequence has c = 31, so the rule describing all tuples encountered so far becomes (10 a < 25) (26 c 31). If the next tuple has c = 29 then the rule remains unchanged because it correctly describes the set of three tuples which includes this new tuple. Thus each new tuple encountered during the scan through the ordered table will either extend the consequent range or leave it unchanged, so that when no more tuples satisfy the selection condition (10 a < 25) the rule describes all tuples in that sub-set. The next tuple in the sorted data sequence starts a new descriptor for the next sub-set, with antecedent (25 a < 40), for example. When the end of the sorted table is reached, the computer has produced the specified number of sub-range descriptor rules. 5. Each computer returns its set of rules to the Main Process, which merges corresponding rules from all the separate computers to create a single rule set with the specified number of rules. This rule set describes the whole database table. Corresponding rules are rules with the same antecedent condition, produced in separate computers. Rule merging is just another stage of incremental rule generation. For example, rules (40 a < 55) (61 c 83) and (40 a < 55) (68 c 74) are provided by two computers. The combined rule is (40 a < 55) (61 c 83), since this describes both sets. If a further computer provides the rule (40 a < 55) (75 c 85), the descriptor for the Union of the three tuple sub-sets is (40 a < 55) (61 c 85). Another computer returns (40 a < 55) no tuples, so the rule remains (40 a < 55) (61 c 85). 5. Performance of the Multi-computer Rule Derivation Algorithm The elapsed time for multi-computer rule derivation has been measured in experiments. The following graph shows a typical example of the experimental results obtained. It shows the measured times to derive rules from a table with rows, of 112 bytes per row, distributed to varying numbers of networked workstations. The attribute used as antecedent for the derived rules was of Character String type, which is much slower to sort than numeric attributes. Although measured times for numeric antecedent attributes are much shorter, the shape of the graph is very similar, indicating a rapid reduction in time as the number of computers used increases. This is the time needed to derive a set of rules from a database table. The rule set is like a histogram with an Attribute Pair rule or multi-consequent rule describing the subset represented by each bar of the histogram. 4
5 Measured Time to Derive Rules from tuples whose antecedent attribute is of String type 700 Elapsed Time (seconds) Measured Time Expected time 625/H : Number of Computers used in the local network Fig. 2 Observed Performance of Multi-Computer Rule Derivation Total time is significantly reduced by working with multiple computers. But the time reduction is also remarkable in being better than one might predict. Dividing work between three workers can divide the total time by three; although additional work to distribute data and synchronize the workers may prevent the theoretical speedup of T/H, where T is the time for a single worker and H is the number of workers. The graph plots values of T/H for comparison with the measured times. T was 625 seconds. For two or more computers the elapsed time was found to be shorter than T/H. Values plotted in the graph are as follows. No. of Hosts, H : Measured Time : Expected, 625/H : The tuple (14 Mbyte) example is typical of results from experiments on data sets of various sizes and data types. Better than T/H performance was observed for all. Several factors contribute to this speedup. The NlogN complexity of the Quicksort algorithm, which consumes most of the elapsed time in the rule derivation process, is one factor. If the elapsed time, T, to sort a set of data is proportional to N.logN, then T = (1/k).N.logN, where 1/k is the constant of proportionality. But T = 625 seconds when N = , so k = Then values for T can be predicted as (1/1066)N.logN, where N is /H, and H is the number of workstations. But observed times are still significantly faster than these predicted times. The following graph indicates the connection between T/H and NlogN as the size of the data sub-set in each machine decreases as the data set is partitioned between more computers. 5
6 N N*logN /H Number of Computers, H ( N is the number of Data Items per computer, i.e /H ) Fig. 3 Comparison of NlogN values with T/H A second factor, which contributes to the large speedup when distributing the sort algorithm, is the amount of paging required as the size of data set to be sorted increases. The proportion of pages which cause page faults, requiring swapping from disk, increases with the amount by which the data set exceeds the available main memory space. Each disk access is a severe time penalty. So the smaller data sets provided by division to more machines reduce the number of these delays. A third factor, related to available main memory space and paging, is the data transfer time when sending large data subsets to computers to sort. Message passing is used between computers. The receive buffer in PVM message passing is limited by the amount of main memory available to dynamically utilise as buffer space. Blocking send is used to reliably transfer data, so that delays can occur as the amount of data exceeds the amount of physical memory space. Paging to virtual memory must occur before physical memory frames are available as buffer space to accept more data. This delay does not occur when the number of computers used is great enough (depending on the size of the whole data set). 6. Rule Maintenance If the data changes, rules describing the data may need to change. Insert, Delete and Update are the database operations that can change the data. Tuple INSERT has the same effect on descriptors as a new tuple encountered during the table scan described in section 4. The numeric or string value of the antecedent attribute in the new tuple maps to the relevant rule. Assertions in that rule describing other attributes may need to be extended by values in this new tuple. If several sets of rules exist, each with a different antecedent attribute, then the new tuple maps to one rule in each set. Deleting a tuple does not require any change to range assertion rules, since deletion does not falsify rules. Any remaining data values are still within consequent-specified range limits. However, choosing to create new descriptors for any rule whose antecedent includes attribute values in the deleted tuple may provide narrower ranges as consequent assertions. This is beneficial because narrow consequent ranges can match more query conditions. 6
7 Updating a single tuple changes the value of one or more fields in an existing tuple and is equivalent to reading and Deleting the tuple before Inserting the new version. Rule maintenance actions are therefore the same as Delete followed by Insert. However, if an Update changes a field in all tuples in the table, the server will disable all assertions about that column of the table until they can all be revised. This makes one of the elements in some rule consequent vectors temporarily unavailable. When a new tuple is Inserted into the database table it is also sent to one of the computers to add to its data subset. As a result of this new data the computer may notify the Main Process that one of its subset descriptor rules has changed. For example, a(15..20) c(63..91) is a revised rule produced by the computer. To merge this with the existing rule set, the current rule: a(15..20) c(29..71) which was previously produced by merging results from all computers, becomes a(15..20) c(29..91). If a tuple is Updated, in the central database table, the old version of the tuple is broadcast to all computers, so that the machine with a matching tuple can delete it, before Inserting the new version. After a delete, n of the rules can be revised in the affected computer, where n attributes were updated. It then notifies the Main Process that an improved version of that particular subset descriptor is available, and the Main Process examines the corresponding rule from all other computers in order to create a new merged descriptor for that subset. The master computer retains all the rule sub-sets created in all the slaves, to use in this incremental rule maintenance process. 7. Conclusions Converting a Database table to a set of subset descriptors rules is a data reduction process, because the rule set is much smaller than its data set. (The descriptors provide a summary of the data). Partitioning a data set and then merging rule sets derived from the partitions is found to be an effective way to speed up the creation of rule sets. A sorting algorithm was used to get the data subsets into a structure (a set of sorted sequences) which can be used as a look-up table to rapidly derive rules and to update those rules when the data changes. Merging rule sets from a collection of workstations is very fast. Much faster than merging sorted data subsets. The configuration of 'master' workstation with a set of 'slave' workstations in a local area network provides an effective way to solve the problem of maintaining derived descriptors rules as the data changes. The master workstation is (also) the user interface to the database, accepting queries and data updates from networked users. It sends all data changes to the slave workstations as well as to the data server, and the slaves respond with any changes caused to their rule subsets. The workload of rule derivation and maintenance does not affect the data server, because it is done on different computers. Workstations in a local network are commonly underutilised. Their computing capacity is rarely used to its full extent because modern desktop computers are powerful machines. But typical application programs have a use profile that makes the machines virtually idle for most of the time, with occasional bursts of activity. We utilise such networked workstations as a distributed computing resource, to derive and maintain data descriptor rules by means of background programs on the workstations. The master workstation uses derived rules for semantic query optimisation [13,16,18] and for remote client cache management [20], but it can also answer queries from the sorted data in slave workstations as well as from the data server. This method of query optimisation by generating query execution plans that use workstation data sets, as well as the database data server, is the subject of our current research. 7
8 References 1. S. Abiteboul, R. Hull and V. Viannu, Foundations of Databases, Addison-Wesley, Adali, S., Candan, K. S., Papakonstantinou, Y., Subrahmanian, V. S.: Query Caching and Optimization in Distributed Mediator Systems. Proc ACM SIGMOD Conf. (1996) Julie Basu, Meikel Poess, and Arthur M. Keller, Performance Analysis of an Associative Caching Scheme for Client-Server Databases, Technical Note STAN-CS-TN-97-61, Stanford University, Computer Science Dept., September Julie Basu, Meikel Poess, and Arthur M. Keller, High Performance and Scalability Through Associative Client-Side Caching, Seventh International Workshop on High Performance Transaction Systems, Pacific Grove, CA, September Dar, S., Franklin, M. J., Jonsson, B. T., Srivastava, D., Tan, M.: Semantic Data Caching and Replacement, Proc. 22nd VLDB Conference (1996) A Geist, et al, "PVM: Parallel Virtual Machine. A Users' Guide and Tutorial for Networked Parallel Computing", MIT Press, Godfrey, P., and Gryz, J., Semantic Query Caching for Heterogeneous Databases, KRDB'97, 4th International Workshop on Knowledge Representation meets Data Bases, , Keller, A. M., Basu, J.: A Predicate-based Caching Scheme for Client-Server Database Architectures. VLDB Journal 5(1) 1996, G. Piatetsky-Shapiro, Discovery, Analysis and Presentation of Strong Rules, Knowledge Discovery in Databases, Eds. G. Piatetsky-Shapiro and W. J. Frawley, MIT Press (1991) Qian, X.: Query Folding. 12th IEEE Intl. Conference on Data Engineering (1996) Robinson, J., Lowden, B. G. T.: Data Analysis for Query Processing. 2nd Intl. Symposium on Intelligent Data Analysis (1997) (LNCS 1280) 13. Robinson, J., Lowden, B. G. T.: Semantic Query Optimisation and Rule Graphs. KRDB'98, 5th International Workshop on Knowledge Representation meets Data Bases, , J. Robinson and B. G. T. Lowden, Attribute-Pair Range Rules. Proc. DEXA'98, 9th Intl. Conference on Database and Expert Systems Applications (1998) (LNCS 1460) 16. S. Shekhar, B. Hamidzadeh, A. Kohli, and M. Coyle. Learning transformation rules for semantic query optimization: A data-driven approach, IEEE Transactions on Knowledge and Data Engineering, 5(6), , S.T. Shenoy, Z.M. Ozsoyoglu, A System for Semantic Query Optimization, Proc ACM SIGMOD Conference, 1987, pp M. Siegel, E. Sciore, S. Salveter, A Method for Automatic Rule Derivation to Support Semantic Query Optimization, ACM TODS 17(4) , Divesh Srivastava, Shaul Dar, H. V. Jagadish, Alon Y. Levy, Answering Queries with Aggregation Using Views, Proc. 22 nd VLDB Conference (1996) J. Robinson and B. G. T. Lowden, Extending the Re-use of Query Results at Remote Client Sites, Proc. DEXA 00, 11th Intl. Conf. on Database and Expert Systems Applications, 2000, pages Springer (LNCS 1873). 8
Utilizing Multiple Computers in Database Query Processing and Descriptor Rule Management
Utilizing Multiple Computers in Database Query Processing and Descriptor Rule Management Jerome Robinson, Barry G. T. Lowden, Mohammed Al Haddad Department of Computer Science, University of Essex Colchester,
More informationAttribute-Pair Range Rules
Lecture Notes in Computer Science 1 Attribute-Pair Range Rules Jerome Robinson Barry G. T. Lowden Department of Computer Science, University of Essex Colchester, Essex, CO4 3SQ, U.K. {robij, lowdb}@essex.ac.uk
More informationThe Use of Statistics in Semantic Query Optimisation
The Use of Statistics in Semantic Query Optimisation Ayla Sayli ( saylia@essex.ac.uk ) and Barry Lowden ( lowdb@essex.ac.uk ) University of Essex, Dept. of Computer Science Wivenhoe Park, Colchester, CO4
More informationI. Khalil Ibrahim, V. Dignum, W. Winiwarter, E. Weippl, Logic Based Approach to Semantic Query Transformation for Knowledge Management Applications,
I. Khalil Ibrahim, V. Dignum, W. Winiwarter, E. Weippl, Logic Based Approach to Semantic Query Transformation for Knowledge Management Applications, Proc. of the International Conference on Knowledge Management
More informationUsing A Network of workstations to enhance Database Query Processing Performance
Using A Network of workstations to enhance Database Query Processing Performance Mohammed Al Haddad, Jerome Robinson Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4
More informationTHE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER
THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose
More informationA Statistical Approach to Rule Selection in Semantic Query Optimisation
A Statistical Approach to Rule Selection in Semantic Query Optimisation Barry G. T. Lowden and Jerome Robinson Department of Computer Science, The University of ssex, Wivenhoe Park, Colchester, CO4 3SQ,
More informationOracle Database 11g: SQL Tuning Workshop
Oracle University Contact Us: Local: 0845 777 7 711 Intl: +44 845 777 7 711 Oracle Database 11g: SQL Tuning Workshop Duration: 3 Days What you will learn This Oracle Database 11g: SQL Tuning Workshop Release
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationFast Discovery of Sequential Patterns Using Materialized Data Mining Views
Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo
More informationA Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture
A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses
More informationDynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering
Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of
More informationA Fast Transformation Method to Semantic Query Optimisation
A Fast Transformation Method to Semantic Query Optimisation Ayla Sayli ( saylia@essex.ac.uk ) and Barry Lowden ( lowdb@essex.ac.uk ) University of Essex, Dept. of Computer Science, Wivenhoe Park, Colchester,
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationMigrating to Object Data Management
Migrating to Object Data Management Arthur M. Keller * Stanford University and Persistence Software Paul Turner Persistence Software Abstract. We discuss issues of migrating to object data management.
More informationKnowledge Discovery from Client-Server Databases
Knowledge Discovery from Client-Server Databases Nell Dewhurst and Simon Lavington Department of Computer Science, University of Essex, Wivenhoe Park, Colchester CO4 4SQ, UK neilqessex, ac.uk, lavingt
More informationA Case for Merge Joins in Mediator Systems
A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract
More informationQuery Rewriting Using Views in the Presence of Inclusion Dependencies
Query Rewriting Using Views in the Presence of Inclusion Dependencies Qingyuan Bai Jun Hong Michael F. McTear School of Computing and Mathematics, University of Ulster at Jordanstown, Newtownabbey, Co.
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationComputing Data Cubes Using Massively Parallel Processors
Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li {luhj,huangxia,lizhixia}@iscs.nus.edu.sg Department of Information Systems and Computer Science National University
More informationStriped Grid Files: An Alternative for Highdimensional
Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationAdvanced Databases: Parallel Databases A.Poulovassilis
1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger
More informationOLAP Introduction and Overview
1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata
More informationHorizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator
Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator R.Saravanan 1, J.Sivapriya 2, M.Shahidha 3 1 Assisstant Professor, Department of IT,SMVEC, Puducherry, India 2,3 UG student, Department
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationParallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism
Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large
More informationDatabase Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building
External Sorting and Query Optimization A.R. Hurson 323 CS Building External sorting When data to be sorted cannot fit into available main memory, external sorting algorithm must be applied. Naturally,
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationCS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)
CS614- Data Warehousing Solved MCQ(S) From Midterm Papers (1 TO 22 Lectures) BY Arslan Arshad Nov 21,2016 BS110401050 BS110401050@vu.edu.pk Arslan.arshad01@gmail.com AKMP01 CS614 - Data Warehousing - Midterm
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationDISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA
DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678
More informationNovel Materialized View Selection in a Multidimensional Database
Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/
More informationArchitecting Object Applications for High Performance with Relational Databases
Architecting Object Applications for High Performance with Relational Databases Shailesh Agarwal 1 Christopher Keene 2 Arthur M. Keller 3 1.0 Abstract This paper presents an approach for architecting OO
More informationINTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY
[Agrawal, 2(4): April, 2013] ISSN: 2277-9655 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY An Horizontal Aggregation Approach for Preparation of Data Sets in Data Mining Mayur
More informationA Fast Method for Ensuring the Consistency of Integrity Constraints
A Fast Method for Ensuring the Consistency of Integrity Constraints Barry G. T. Lowden and Jerome Robinson Department of Computer Science, The University of Essex, Wivenhoe Park, Colchester CO4 3SQ, Essex,
More informationDesigning Views to Answer Queries under Set, Bag,and BagSet Semantics
Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Rada Chirkova Department of Computer Science, North Carolina State University Raleigh, NC 27695-7535 chirkova@csc.ncsu.edu Foto Afrati
More informationScalability via Parallelization of OWL Reasoning
Scalability via Parallelization of OWL Reasoning Thorsten Liebig, Andreas Steigmiller, and Olaf Noppens Institute for Artificial Intelligence, Ulm University 89069 Ulm, Germany firstname.lastname@uni-ulm.de
More informationAn Overview of various methodologies used in Data set Preparation for Data mining Analysis
An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationFinal Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23
Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE
More informationQUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION
E0 261 Jayant Haritsa Computer Science and Automation Indian Institute of Science JAN 2014 Slide 1 Database Engines Main Components Query Processing Transaction Processing Access Methods JAN 2014 Slide
More informationHorizontal Aggregations for Mining Relational Databases
Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,
More informationAn Oracle White Paper April 2010
An Oracle White Paper April 2010 In October 2009, NEC Corporation ( NEC ) established development guidelines and a roadmap for IT platform products to realize a next-generation IT infrastructures suited
More informationA FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS
A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS SRIVANI SARIKONDA 1 PG Scholar Department of CSE P.SANDEEP REDDY 2 Associate professor Department of CSE DR.M.V.SIVA PRASAD 3 Principal Abstract:
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationData Access Paths for Frequent Itemsets Discovery
Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationScalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX
Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX ABSTRACT Symmetric multiprocessor (SMP) computers can increase performance by reducing the time required to analyze large volumes
More informationData integration supports seamless access to autonomous, heterogeneous information
Using Constraints to Describe Source Contents in Data Integration Systems Chen Li, University of California, Irvine Data integration supports seamless access to autonomous, heterogeneous information sources
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationMapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1
MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data
More informationSimilarity Joins of Text with Incomplete Information Formats
Similarity Joins of Text with Incomplete Information Formats Shaoxu Song and Lei Chen Department of Computer Science Hong Kong University of Science and Technology {sshaoxu,leichen}@cs.ust.hk Abstract.
More informationISSUES IN SPATIAL DATABASES AND GEOGRAPHICAL INFORMATION SYSTEMS (GIS) HANAN SAMET
zk0 ISSUES IN SPATIAL DATABASES AND GEOGRAPHICAL INFORMATION SYSTEMS (GIS) HANAN SAMET COMPUTER SCIENCE DEPARTMENT AND CENTER FOR AUTOMATION RESEARCH AND INSTITUTE FOR ADVANCED COMPUTER STUDIES UNIVERSITY
More informationEvaluation of Parallel Programs by Measurement of Its Granularity
Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl
More informationTPC-DI. The First Industry Benchmark for Data Integration
The First Industry Benchmark for Data Integration Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, Brian Caufield VLDB 2014, Hangzhou, China, September 4 Data Integration Data Integration (DI) covers a
More informationData Warehousing and Decision Support
Data Warehousing and Decision Support Chapter 23, Part A Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical
More informationMining Distributed Frequent Itemset with Hadoop
Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario
More informationPivoting M-tree: A Metric Access Method for Efficient Similarity Search
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic tomas.skopal@vsb.cz
More informationOn Multiple Query Optimization in Data Mining
On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl
More information2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationUpdates through Views
1 of 6 15 giu 2010 00:16 Encyclopedia of Database Systems Springer Science+Business Media, LLC 2009 10.1007/978-0-387-39940-9_847 LING LIU and M. TAMER ÖZSU Updates through Views Yannis Velegrakis 1 (1)
More informationNew Join Operator Definitions for Sensor Network Databases *
Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 41 New Join Operator Definitions for Sensor Network Databases * Seungjae
More informationCAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1
CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query Sub-System Queries Select * From Blah B Where B.blah = blah Query Parser Query Optimizer Plan Generator Plan Cost
More informationIncreasing Database Performance through Optimizing Structure Query Language Join Statement
Journal of Computer Science 6 (5): 585-590, 2010 ISSN 1549-3636 2010 Science Publications Increasing Database Performance through Optimizing Structure Query Language Join Statement 1 Ossama K. Muslih and
More informationImpala Intro. MingLi xunzhang
Impala Intro MingLi xunzhang Overview MPP SQL Query Engine for Hadoop Environment Designed for great performance BI Connected(ODBC/JDBC, Kerberos, LDAP, ANSI SQL) Hadoop Components HDFS, HBase, Metastore,
More informationB.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2
Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,
More informationQuery optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.
Database Management Systems DBMS Architecture SQL INSTRUCTION OPTIMIZER MANAGEMENT OF ACCESS METHODS CONCURRENCY CONTROL BUFFER MANAGER RELIABILITY MANAGEMENT Index Files Data Files System Catalog DATABASE
More informationWeb-based Energy-efficient Cache Invalidation in Wireless Mobile Environment
Web-based Energy-efficient Cache Invalidation in Wireless Mobile Environment Y.-K. Chang, M.-H. Hong, and Y.-W. Ting Dept. of Computer Science & Information Engineering, National Cheng Kung University
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationAn Information-Theoretic Approach to the Prepruning of Classification Rules
An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More informationPerformance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads
Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi
More informationTraining. Data Modelling. Framework Manager Projects (2 days) Contents
We aim to provide you with the right training, at the right time and at the right price'. A cost effective solution to your business objectives. Our trainers are experts in IBM Cognos applications and
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationAnalyzing Dshield Logs Using Fully Automatic Cross-Associations
Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu
More informationPerformance Optimization for Informatica Data Services ( Hotfix 3)
Performance Optimization for Informatica Data Services (9.5.0-9.6.1 Hotfix 3) 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
More informationHash-Based Indexing 165
Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19
More informationThe Design and Optimization of Database
Journal of Physics: Conference Series PAPER OPEN ACCESS The Design and Optimization of Database To cite this article: Guo Feng 2018 J. Phys.: Conf. Ser. 1087 032006 View the article online for updates
More informationUniversity of Waterloo Midterm Examination Sample Solution
1. (4 total marks) University of Waterloo Midterm Examination Sample Solution Winter, 2012 Suppose that a relational database contains the following large relation: Track(ReleaseID, TrackNum, Title, Length,
More information1. Attempt any two of the following: 10 a. State and justify the characteristics of a Data Warehouse with suitable examples.
Instructions to the Examiners: 1. May the Examiners not look for exact words from the text book in the Answers. 2. May any valid example be accepted - example may or may not be from the text book 1. Attempt
More informationJob Re-Packing for Enhancing the Performance of Gang Scheduling
Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationBuilt for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations
Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning
More informationCombining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms
Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohio-state.edu
More informationData Warehousing and Decision Support
Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 4320 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business
More informationJoin (SQL) - Wikipedia, the free encyclopedia
페이지 1 / 7 Sample tables All subsequent explanations on join types in this article make use of the following two tables. The rows in these tables serve to illustrate the effect of different types of joins
More informationA Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase
More informationSupporting Fuzzy Keyword Search in Databases
I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as
More informationHYRISE In-Memory Storage Engine
HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University
More informationOptimizing System Performance
243 CHAPTER 19 Optimizing System Performance Definitions 243 Collecting and Interpreting Performance Statistics 244 Using the FULLSTIMER and STIMER System Options 244 Interpreting FULLSTIMER and STIMER
More informationOptimising Mediator Queries to Distributed Engineering Systems
Optimising Mediator Queries to Distributed Engineering Systems Mattias Nyström 1 and Tore Risch 2 1 Luleå University of Technology, S-971 87 Luleå, Sweden Mattias.Nystrom@cad.luth.se 2 Uppsala University,
More informationAnalysis of Basic Data Reordering Techniques
Analysis of Basic Data Reordering Techniques Tan Apaydin 1, Ali Şaman Tosun 2, and Hakan Ferhatosmanoglu 1 1 The Ohio State University, Computer Science and Engineering apaydin,hakan@cse.ohio-state.edu
More informationA New Online Clustering Approach for Data in Arbitrary Shaped Clusters
A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK
More informationData Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A
Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 432 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business
More informationA 12-STEP SORTING NETWORK FOR 22 ELEMENTS
A 12-STEP SORTING NETWORK FOR 22 ELEMENTS SHERENAZ W. AL-HAJ BADDAR Department of Computer Science, Kent State University Kent, Ohio 44240, USA KENNETH E. BATCHER Department of Computer Science, Kent State
More informationDistributed File Systems. CS 537 Lecture 15. Distributed File Systems. Transfer Model. Naming transparency 3/27/09
Distributed File Systems CS 537 Lecture 15 Distributed File Systems Michael Swift Goal: view a distributed system as a file system Storage is distributed Web tries to make world a collection of hyperlinked
More information