Sales cust id product id. day id. price qty. Time. day id. day month year...

Size: px
Start display at page:

Download "Sales cust id product id. day id. price qty. Time. day id. day month year..."

Transcription

1 Data Mining in Data Warehouses Elena Baralis x Rosa Meo x Giuseppe Psailay x Politecnico di Torino, Dipartimento di Automatica e Informatica Corso Duca degli Abruzzi, 24 - I Torino, Italy y Dipartimento di Elettronica e Informazione, Politecnico di Milano P.za Leonardo da Vinci Milano, Italy baralis/rosimeo@polito.it, psaila@elet.polimi.it Abstract Data warehouses provide an integrated environment where huge amounts of data extracted from operational sources are available for various kinds of decision support analysis. Hence, in order to allow the user to improve the quality of the performed analysis, it is becoming of fundamental importance to eectively integrate mining capabilities and data warehousing technology. This paper describes AMORE-DW, an integrated environment for the specication of data mining requests and the extraction of association rules from a data warehouse. The adopted architecture is characterized by a tight coupling of data mining with the relational OLAP (ROLAP) server on the data warehouse, that provides ecient access to the data to be analyzed. The main issues faced during the design are presented and the trade-o between exible data analysis and system performance is discussed. 1 Introduction The availability of an ecient and reliable database technology allows the massive and systematic gathering of huge amounts of operational data concerning every kind of human activity, such as business transactions or scientic research. Before being analyzed, the raw data, possibly extracted from heterogeneous sources, needs to be properly integrated and carefully cleaned, in order to allow the extraction of reliable information. Furthermore, data analysis algorithms perform complex (and expensive) operations on data, which are best performed in a separated environment to avoid hampering daily operational data processing. The data warehousing technology [6], which experienced an explosive growth in the past few years, is able to provide an integrated environment where data extracted from operational sources are available for dierenttypes of decision support analysis. Hence, data warehouses are expected to naturally become a major platform for data mining [7]. To extract useful information that may be exploited, e.g., in business decision making, the data stored in the data warehouse must be analyzed with appropriate techniques. While OLAP (On Line Analytical Processing) analysis is devoted to the computation of complex aggregations on the data, data mining is focused on the extraction of the most frequent regularities that characterize the data. Such regularities are described by means of specic models, which give a more abstract description of the data. An important class of data mining problems is represented by means of association rules. Association rules describe the most common links among data items in a large amount of collected data. The \classical" example application of association rules discovery [1] is the analysis of data recording customer purchases at supermarket checkouts. In this context, association rules describe which products are likely to be bought together by a statistically relevant number of customers.the discovered information can then be used by the store management to support strategical decisions, e.g., for planning and marketing. In general, an association rule is characterized by the structure: This work has been supported by the Interdata MURST grant. 1

2 X )Y where X, the rule body, and Y, the rule head, are two sets of values drawn from the mined attribute, i.e., the attribute whose behavior is observed (e.g., bought items in the above example). To perform rule extraction, data are grouped by some attribute (e.g., customer transactions) rules describe regularities of the mined attribute with respect to the groups. The relevance of a rule is expressed in terms of the frequency with which a rule is observed in the analyzed data. Thus, two probability measures, called support and condence, are associated to a rule. The support is the joint probability tondin the same group X and Y. The condence is the conditional probability to nd in a group Y having found X. Two minimum thresholds for support and condence are dened with the purpose to discard the less frequent associations in the database. The integration of data mining techniques with data warehousing technology will enhance both the data analysis capabilities provided by current data warehouse products and the expressive power and exibility of the analysis performed by current data mining tools. In fact, the current commercial ROLAP (Relational OLAP) servers provide both the powerful data retrieval services of relational DBMS servers and ad-hoc OLAP optimization techniques. This in turn allows the data analysts to specify complex (more rened than current data mining tools would allow) search criteria in order to extract more useful knowledge from the raw data stored in the warehouse. These considerations inspired the AMORE-DW (Advanced Mining On Relational Environments - Data Warehousing) project. The main goal is the development of a mining tool tightly integrated with the data warehouse and its ROLAP server, such that the source data are constituted by the data collected in the data warehouse, and the extracted rules are represented as database relations. In this context, the description of mining requests is performed by means of an SQL-like language, that allows a exible specication of mining statements and extends the semantics of other languages [8] this operator also provides specic constructs to deal with the data schema typical of data warehouses (known as star schema), in order to simplify the specication of the mining request and to allow the data mining tool to perform appropriate optimizations to reduce the cost of the analysis. In this paper we illustrate the main features of the AMORE-DWenvironmentandwe discuss the design issues and the problems encountered in integrating data mining and data warehousing technology. In particular, in Section 2 we describe a declarative language for the specication of mining requests, based on the SQL-like operator MINE RULE, while in Section 3 we present the architecture of the AMORE-DW prototype. Section 4 discusses the trade-o between exible data analysis and system performance when dening which tasks of the rule extraction process are to be performed by the OLAP server or by special-purpose extraction algorithms. Section 5 draws conclusions. 2 Specication Language for Association Rules Extraction Several algorithms have been proposed to extract specic types of association rules (e.g. \classical" association rules [1], generalized association rules [12],...) which operate on xed format data and are tailored to the specic type of association rules to be searched. We propose a general purpose, SQL-like language to declaratively specify the features of the association rules to be extracted from the data warehouse. Since this language is not bound to any specic data schema, it allows the user to freely search through the whole schema of the data warehouse. Furthermore, it allows the specication of complex extraction criteria, which are not available in traditional association rule extraction prototypes. Thus, the user is able to restrict the search space by progressively rening the characteristics of the association rules to be extracted. An initial description of MINE RULE, the main operator of the language, can be found in [9] it is extended here to cope with the specic features of data warehouses. The operator is introduced by means of an example, which species several complex extraction criteria. The example is based on the data warehouse schema, describing sales in a supermarket, which is represented in Figure 1. 2

3 Customers cust id name address birth year city city id region region id... Sales cust id product id day id price qty Time day id day month year... Products product id product subcategory id subcategory category id category... Figure 1: Star schema of the supermarket data warehouse bold attribute names represent the primary key of each table. cust-id product-id day-id price qty c-1 p-1 d c-1 p-2 d c-2 p-3 d c-2 p-4 d c-2 p-5 d c-1 p-4 d c-2 p-3 d c-2 p-5 d (a) cust-id day-id product-id price qty p c-1 d-1 p d-2 p p d-2 p c-2 p p d-3 p (b) Figure 2: The Sales fact table: (a) simplied instance, (b) grouped by cust-id and clustered by day-id. Since data warehouse schemas usually have a star topology, they are called star schemas. In this example, the star center, called fact table, describes sales gures for a supermarket. In particular, each purchase is performed by a specic customer (supposing the store uses customer cards to identify its customers) to buy a specic product (identied, e.g., by means of the bar code) in a specic date. The above informations are the main dimensions along which each sale event is characterized. Further information on each dimension can be obtained by the appropriate dimension table describing in more detail the features of customers (e.g., demographic information, complete address,... ), products (e.g., merchandise hierarchy, given by product sub-category and category, and other attributes, such as package type, and many more), and time (e.g., time hierarchy, given by day, week or month, year, and other attributes useful for tracking sales such as holiday and special events indication,...). Each sale fact in the Sales table is further characterized by two measures: the price and the quantity of the sold product. We nally observe that non numerical attributes (e.g., category in table Products) are included in the dimension tables also in encoded format 1. This format is usually used in data processing instead of the original one for eciency reasons. For the sake of simplicity, to illustrate the expressive power of the MINE RULE operator, we consider the reduced instance of the Sales table presented in Figure 2(a). Suppose we want to extract association rules with the following features: a) Rules describe the behavior of customers in terms of the sets of products most frequently purchased by them. b) Only customers born after 1970 that purchased at least two products are considered. 1 The attributes are encoded during the loading process and the periodical refresh of the data warehouse. Note that dimensions are seldom updated during the periodical refresh of the data warehouse. 3

4 c) Products appearing in the body must be purchased in the same date, after October. Products in the head are purchased in the same date, but after the products in the corresponding body. d) Products in the body have a price less than or equal to 200$, whereas products in the head cost less than products in the body. e) Rules are interesting only if their support is at least 20% and their condence is at least 30% 2. The following statement allows the extraction of association rules corresponding to the above specication. MINE RULE YoungCustomers AS SELECT DISTINCT 1..n product-id->product AS BODY, 1..n product-id->product AS HEAD, SUPPORT, CONFIDENCE WHERE BODY.price <= 200 AND HEAD.price < BODY.price FROM Sales GROUP BY cust-id HAVING COUNT(*) >= 2 AND cust-id->birth-year > 1970 CLUSTER BY day-id HAVING BODY.day-id->day < HEAD.day-id->day AND BODY.day-id->month > 10 EXTRACTING RULES WITH SUPPORT: 0.2, CONFIDENCE: 0.3 The association rules are extracted by performing the following steps: Data Source. The FROM clause species the source data to analyze. Only the fact table needs to be specied. To reference attributes in the dimension tables, the referencing \->" operator is used to follow the foreign key constraint from the fact table to the dimension table. For example, the attribute birth-year of the Customers dimension is reached through the attribute cust-id. Observe that the analysis always involves the data contained in the fact table the dimensions may bereferenced as well, but never in absence of the fact table. Group computation. The GROUP BY clause species that the source relation Sales is logically partitioned into groups of tuples having the same value for the grouping attribute cust-id (corresponding to feature (a) above). Group ltering. The (optional) HAVING clause associated to the GROUP BY clause discards, before rule extraction, all groups with less than two tuples or whose customer is born before 1970 (corresponding to feature (b) above). Cluster identication. The (optional) CLUSTER BY clause further partitions each group into sub-groups called clusters, such that tuples in a cluster have the same value for the clustering attribute day-id. The result of both grouping and clustering of the data instance in Figure 2(a) is represented In Figure 2(b). When clustering is specied, the body of a rule is extracted from (smaller) clusters instead of entire groups, and analogously for rule heads. Thus elements in the body (and head) share the same value of the clustering attribute (corresponding to feature (c) above). Cluster coupling. To compose rules, every pair of clusters (one for the body and one for the head) inside the same group is considered. Furthermore, the optional HAVING clause of the CLUSTER BY clause selects the cluster pairs that should be considered for extracting rules. In this case, a pair of clusters is considered only if the day ofthe left hand cluster, the body cluster, precedes the day of the right hand cluster, the head cluster (corresponding to feature (c) above). 2 These gures are meaningful only for the example data warehouse in real applications they would be signicantly lower. 4

5 Rule extraction. From each group, the SELECT clause extracts all possible associations of an unlimited set of products (clause 1..n product-id->product AS BODY), representing the body of rules, with another set of products (clause 1..n product-id->product AS HEAD), representing the head of rules (corresponding to feature (a) above). Mining condition. The (optional) WHERE clause following the SELECT clause forces rule extraction to consider only pairs of tuples, the rst one (called body tuple) coming from the body cluster and the second one (called head tuple) coming from the head cluster, such that the value of attribute price in the body tuple is less than or equal to 200, and it is higher than the value of attribute price in the head tuple (corresponding to feature (d) above). Support and condence evaluation. The support of a rule is the number of groups from which the rule is extracted divided by the total number of groups generated by the GROUP BY clause. The condence is the number of groups from which the rule is extracted divided by thenumber of groups that contain the body in some cluster. When support or condence are lower than the respective minimum thresholds (20=100 = 0:2 for support and 30=100 = 0:3 for condence in our sample statement, see feature (e)), the rule is discarded. This is the rule set extracted from the Sales table instance presented in Figure 2(a): Rule Support Condence fbrown bootsg!fcol shirtsg fjacketsg!fcol shirtsg fbrown boots,jacketsg!fcol shirtsg where brown boots corresponds to product identier p-3, jackets to p-4, col shirts to p-5. Since both rule bodies and heads are sets, they can be represented as relational set attributes, similarly to the forthcoming SQL3 standard. The statement described above species very complex extraction criteria and exploits most of the powerful features provided by the MINE RULE operator. Simpler extraction criteria are normally specied in the initial phase of the search process mining requests are then progressively rened adding more complex ltering conditions. Rules can be divided in several (orthogonal) classes, depending on the extraction criteria that guide the extraction process: Simple Association Rules. Only the basic (mandatory) extraction criteria are specied (source data, grouping attribute, mined attribute, body and head cardinality, minimum support and condence). This class corresponds to classical association rules extraction [1]. Filtered Association Rules. A group ltering condition is specied: rules are extracted only from a selected subset of groups. Mining-Constrained Association Rules. A mining condition expresses a complex correlation between rule body and rule head. Clustering Association Rules. The clustering attributes, and possibly the cluster selection predicate, are speci- ed. Bodies and heads of the rules are extracted from clusters instead of entire groups. These classes are not mutually exclusive. Hence, mining problems can be expressed as a combination of the described extraction criteria. 5

6 Translator Mining Statement Mining Kernel ROLAP SQL Server Extracted Rules 3 AMORE-DW Architecture Figure 3: Mining Server architecture. The AMORE-DW prototype is based on a client-server architecture: On the client side, by means of a suite of user-friendly interface tools, the user species mining requests, which are then automatically mapped to statements in the language for the extraction of association rules described in Section 2. Then, the mining statement is submitted to the mining server, which performs the rule extraction. Finally, on the client side, the user may browse the result of the extraction process. The mining server, whichisincharge of the actual rule extraction, is tightly coupled with the ROLAP server, in order to exploit its powerful data manipulation language and its data storage and access facilities. Typically, both servers (the mining server and the ROLAP server) should be resident on the same system, which is usually devoted to run computationally expensive analysis jobs. In the following, we focus on the description of the architecture of the mining server presenting in more detail all its components and their interaction. 3.1 Mining Server Architecture The architecture of the Mining Server is depicted in Figure 3. It encompasses the following components: The Translator interprets the MINE RULE mining statement: it checks the correctness of the request and generates the processing directives for the Mining Kernel. The Mining Kernel is the specialized component for data analysis and association rules extraction. The ROLAP Server prepares the data for the analysis according to the Mining Kernel's directives, provides ecient access to them and stores the analysis results. When a mining request is received by the Mining Server, the following processing steps take place (see Figure 3, where solid arrows indicate the information ow between the components of the system): 1. The Translator performs the syntactic and semantic verication of the statement's correctness. It checks the denition of table and attribute names referenced in the statement into the ROLAP server Data Dictionary. This information ow is represented in Figure 3 by the edge connecting the ROLAP Server to the Translator. 2. The Translator extracts the features characterizing the submitted statement and generates processing directives for the Mining Kernel. In the gure, the processing directives sent to the Mining Kernel are represented by the edge from the Translator to the Mining Kernel. 6

7 3. The Mining Kernel is activated upon receiving the processing directives from the Translator. It extracts association rules by performing the following tasks: (a) It instructs the ROLAP Server to preprocess the data (see Section 3.3.1). (b) It reads the data and composes association rules (see Section 3.3.2). (c) It stores the extracted rules into the data warehouse. In the gure, the bidirectional edge between the Mining Kernel and the ROLAP Server denotes the exchange of information in both directions. At this point, the extraction process is completed and the user can browse the obtained association rules on the ROLAP Server. An important issue in the implementation of the Mining Server is the identication of the border between typical data processing tasks, to be executed by the ROLAP server, and mining processing, performed by specialized algorithms in the Mining Kernel. At one extreme, the ROLAP server could be used only to retrieve raw data, as in the traditional approach. In this case, the mining algorithm would be overloaded by tasks such as the evaluation of complex SQL predicates on raw data. At the opposite extreme, the entire rule extraction could be performed by means of SQL programs executed by the ROLAP server. This solution is not ecient, since an excessively frequent context-switching between the SQL context and the mining application context would occur [3]. In particular, to decide which tasks are better performed by each server, we considered the following issues: The presence of the reference operator -> in the MINE RULE statements requires some preprocessing to join the data referenced by means of the -> operator. This task is eciently performed by the ROLAP Server. SQL predicates allowed by the MINE RULE operator are eectively evaluated by the ROLAP Server without overloading the mining algorithm. The actual association rules computation is an iterative process [1, 4, 11] that is better performed in main memory with suitable data structures by a specialized algorithm. Hence, many of the extraction features that characterize the MINE RULE operator can be eciently delegated to the ROLAP server. The pool of operations performed by the ROLAP server is embedded into an SQL package called preprocessor (presented in Section 3.3.1). The actual rule discovery process is carried out by a specialized algorithm, in the Mining Kernel, which is called the core operator (presented in Section 3.3.2). 3.2 Translator The translator is in charge of the following tasks: It executes any lexical, syntactic and semantic check on the mining statement. Semantic checks range from the simple verication of correctness of the statement (e.g., the source specication should reference tables existing in the database dictionary), to the enforcement of constraints on the mining statement (e.g., the set of attributes in the GROUP BY clause should be disjoint from the set of attributes in the other clauses, such as the CLUSTER BY or the SELECT clause). It maps all non-numerical attributes (e.g., strings) referenced by the mining statement in non-conditional clauses into the corresponding encoded version 3. Non-numerical attribute encoding allows a signicant eciency improvement for both the preprocessor and the core operator. 3 The correspondence is permanently stored into appropriate meta-data tables. Recall that the values of both the original attribute and its encoded version are stored in the data warehouse. 7

8 It analyzes the syntactic clauses in the MINE RULE statement, determining the class of mining statements to which the current statement belongs, in order to generate the processing directives for the Mining Kernel. MINE RULE statements are classied according to the following categories: Basic statements: all the mandatory clauses are included (MINE RULE, SELECT, FROM, GROUP BY, EXTRACTING RULES WITH). Optionally, the selection of source data (WHERE predicate in the FROM clause) and the selection of relevant groups (HAVING predicate following the GROUP BY clause) may be specied. An arbitrary cardinality is allowed both in body and head. Complex statements: optional features, such as mining condition, clustering attribute(s) (CLUSTER BY clause) and cluster selection predicates (HAVING predicate following the CLUSTER BY clause) are speci- ed. Finally, the translator calls the Mining Kernel, passing to it all the information required for further processing. 3.3 Mining Kernel The Mining Kernel performs the extraction of association rules. Its structure can be divided in two main blocks: The Preprocessor, that performs raw data preprocessing, and yields data in a suitable format for the rule extraction algorithms in the core operator. The Core Operator, which performs the actual rule extraction Preprocessor Depending on the class of mining statements specied by the user, the preprocessor executes distinct SQL programs whose execution provides to the core operator the appropriate set of input data. These programs include database instructions that retrieve source data, evaluate selection predicates and association conditions, and prepare data in the appropriate format for the core operator. This component heavily exploits the services of the ROLAP Server and the specic data warehouse operating context. In particular, for example, it takes advantage of the available mechanisms to disable continuous logging of database operations, which is costly and useless in this operative context. The structure of the preprocessor is depicted in Figure 4. In the gure, ovals indicate SQL programs, while rectangles denote views or temporary tables. Directed edges show the processing ow: arcs entering a program denote its input tables, arcs exiting a program denote its output tables. Labeled arcs denote that the corresponding operation is executed only if the associated feature appears in the mining statement (a legend is reported in Figure 4) the vertical bar ( ) separating two labels denotes their disjunction. Depending on the class of mining statements, the Preprocessor executes the following operations: Basic statements. The preprocessor retrieves source data by means of program Q 0 its result, named Source, may be a view or a temporary table containing the result of joining the fact and the dimension tables. Then, program Q 1 counts the total number of groups in Source (needed by the Core Operator to compute rule support). If a group ltering condition is specied, program Q 2, which selects the set of groups that satisfy the condition, is also executed. Finally, program Q 3 generates the input for the Core Operator, named CoreSource. While in the simplest case CoreSource is a view, it may be temporarily materialized when the Core Operator needs to scan it several times. CoreSource includes only the tuples that must be inspected in the extraction process. 8

9 Complex statements. In this case, since the Preprocessor is in charge of the evaluation of all complex SQL predicates on data, it generates the so-called elementary rules, i.e. basic associations of one item for the body and one item for the head. Hence, some further processing steps are required with respect to the former case. In particular, if either a mining condition is specied, or clustering attributes and possibly a cluster condition are specied, program Q 5 reads view CoreSource and creates the elementary rules, storing them into the AllRules temporary table, together with the list of groups containing each rule. In presence of aggregate functions in the cluster condition, the execution of program Q 5 is preceeded by the execution of program Q 4, which produces table Clusters, where each cluster is associated to the corresponding value of the aggregate functions. This table is then used to compose the elementary rules in program Q 5. Finally, the set of elementary rules having support higher than or equal to the minimum support threshold (the large rules) are computed and stored in temporary table InputRules, which is the input of the Core Operator. In particular, program Q 6 counts the support of each rule and detects the large rules, while program Q 7 prepares table InputRules. Summarizing, the preprocessor produces two types of input data for the core operator: The Source Data. The core operator performs its analysis by reading only the tuples in the CoreSource table, i.e., the actual tuples to be considered in the extraction process. It is unaware of the actual origin of the source data. The Large Elementary Rules. These are the basic associations of one item for the body and one item for the head having support higher than or equal to the minimum support threshold, which are stored in temporary table InputRules. They are produced by theevaluation of the association conditions in the mining statement (e.g., the cluster or the mining condition) Core Operator The data processing tasks performed by the preprocessor allow the core operator to be independent of the selection conditions specied in the MINE RULE statement, e.g., HAVING conditions on clusters. This independence is essential for the simplicity and eciency of the implementation of the extraction algorithms. The features provided by our operator extend the semantics of the association rules with respect to other SQL-like operators proposed in literature [8]. This extended semantics requires that the core operator is implemented with original solutions, that need several adaptations of the well known algorithms proposed in literature [1, 4,11]. The Mining Kernel includes several algorithms, each one specically tailored for the categories of MINE RULE statements outlined in Section Basic statements. In this case, the extraction process corresponds to the traditional rule extraction performed by the algorithms described in [1, 4, 11, 10]. Rule discovery is performed by initially building sets of items with sucient support (called large itemsets), and then creating rules from the large itemsets. All these algorithms are based on the observation that the number of itemsets grows exponentially with the number of items. Since acounter in main memory is needed for each itemset whose support is computed, these itemsets are carefully selected and their number is kept as low as possible. The current version of the core operator for basic statements is based on the algorithm presented in [11]. However, the modularity of our architecture allows us to easily replace this algorithm with any of the algorithms proposed in literature (see [1, 2, 4, 11], etc.). The algorithm operates in two passes. In the rst, the source data is divided in partitions of the same size designed so as the counters for the itemsets can be stored in main memory. The search for large itemsets is performed separately in each partition. This step is based on 9

10 base tables Q0: create Source Source G Q2: create ValidGroups Q1: count total groups ValidGroups G totg Q3: create CoreSource CoreSource F Q4: compute aggregate functions M H C F Clusters Q5: create elementary rules All Rules M H C Q7: select large rules M H C Q6: compute large rules LargeRules M H C InputRules Legend Label G M H C F Meaning of the label Presence of the Group Filtering Condition Presence of the Mining Condition Dierent attribute schema for body and head Clustered statement Aggregate functions in the Cluster Filtering Condition Figure 4: Preprocessor Architecture. Labels on the arcs denote when they are enabled in the table is reported the meaning of the labels. 10

11 the observation that sucient support in at least a partition is a necessary condition for an itemset to have sucient support also with respect to the whole source data. The itemsets that have sucient support within a partition are saved on disk. The second pass computes the eective support of the saved itemsets, by counting the number of groups in the entire database that contain each itemset. The actual support is given by the ratio between this number and the total number of groups (computed by the preprocessor). The actual rule discovery process considers each large itemset previously generated and extracts subsets of items. Indicating with L a large itemset and with H L a subset, the rule (L ; H) ) H is formed. Its support is dened as the itemset support (supp(l)) condence is immediately computed ( supp(l) ). supp(l;h) Complex statements. In this case the core operator receives from the ROLAP server the elementary rules 4 from which rules can be extracted, instead of computing them itself. The algorithm discovers rules starting from formerly generated rules, as opposed to considering the whole itemsets as in the previous algorithm. Rule discovery is performed by steps, increasing progressively at each step the cardinality ofbody and head, and computing rule support and condence. At each step new rules are created by combining two rules found in the previous step. At rst, two rules with matching heads are combined: the new rule has the same head and the body is obtained by the union of the two bodies. In this way the cardinality of the rule body is increased. Afterwards, new rules are created by combining the heads, thus increasing head cardinality. Finally, the algorithm scans the CoreSource table to compute the support of bodies, in order to calculate the condence of extracted rules. 4 Experimental Results The experimental results in this section are aimed to the exploration of the trade-o between system performance and exible data analysis that we faced in the development of our prototype. For the experiments, we generated two synthetic databases, db1 and db2, using the publicly available datagen generator [4]. Database db1 contains groups and each grouphasanaverage of 5 tuples (i.e., each customer's purchase averages 5 products) this yields a total database size of tuples. The chosen data distribution yields rules with an average length of 2 elements. Database db2 has again groups, but the average number of tuples in each group is 10. Hence, the size of db2 is twice the size of db1. Furthermore, in this case, the selected data distribution yields an average rule length of 4 elements. The experiments have been performed on a (non dedicated) Digital Alpha Server with 256MB of RAM and the Oracle 8.0 ROLAP server. The rst experiment, whose result is presented in Figure 5(a), is devoted to the comparison of the performance of our architecture with respect to the traditional at le approach. We extracted simple association rules with minimum support 0.75% from both the data stored in binary les and the data stored in the ROLAP server. The at le approach, which is specically designed for the discovery of association rules with a xed structure and with very simple and unchangeable extraction criteria, yields clearly better performance (roughly faster by a factor of 8 for both data distributions). This result is obtained without performing specic optimizations of the database I/O (e.g., data is read tuple by tuple and not using arrays), to showtheworst case dierence between the two approaches. Optimized database access operations would allow closer performance results. Unfortunately, the superior performance of the at le approach is obtained at the price of loosing completely the possibility ofexiblyvarying the rule extraction criteria. The second experiment shows the eect of adding a mining condition (namely, BODY.price < HEAD.price) to the previous extraction criteria. In this case, we compared both performance and selectivity oftheabove complex 4 Elementary rules represent large itemsets of cardinality two, i.e., the rules with the lowest cardinality for body and head. 11

12 rules Legenda: preprocessing step on DW algorithm step on DW flat file approach rules rules rules db file db file simple with mining condition db1 db2 db1 Experiment a) Experiment b) Figure 5: Experimental results: (a) Comparison of the at le with respect to the AMORE-DW prototype approach, (b) Comparison of the execution of a simple statement with respect to a complex statement in the AMORE-DW architecture. statement with respect to the simple statement considered in the previous experiment. For the experiment, whose result is presented in Figure 5(b), we considered database db1. The addition of a more selective extraction criterion (the mining condition reduces the allowed product pairs) signicantly reduces the number of obtained rules (from 54 for the simple statement, to 7 for the complex statement), hereby improving the signicance of the extracted information. Furthermore, the preprocessing step performed by therolap server allows a signicant reduction of the time spent by the algorithm in building the rules hence, the obtained performance improves signicantly (by a factor larger than 25). In particular, while in the former experiment the work performed by the preprocessor was very small and its eect on the overall performance was negligible, in this case most of the time is spent in the preprocessing step, which prepares the elementary large rules for the extraction algorithm. Hence, the preprocessing step, although taking a longer time than before, can contribute signicantly to the improvement of the overall performance of the architecture. Finally note that the overall processing time for the complex statement isalsolower (roughly by a factor of 3) than the time required by at le processing of the simple statement. 5 Conclusions In this paper we described the architecture of the AMORE-DW system. AMORE-DW provides an environment for the specication of mining requests by means of a powerful SQL-like language for expressing rule extraction criteria. The MINE RULE operator has already been used for the specication of several mining problems applied to heterogeneous domains, e.g., the analysis of telephone data. The architecture of the AMORE-DW environment is characterized by a tight coupling with arolap server, that manages the access to the warehouse data from which rules are extracted. The exibility gained by accessing data stored in relational format is counterbalanced by the reduced performance of the system with respect to the traditional at le approach. Nevertheless, when more selective extraction conditions are specied, a signicant improvement of the overall processing time is obtained, owing to the eective preprocessing of data performed by the ROLAP server. Finally, we observe that a tight coupling of mining and warehousing technology yields several opportunities that we are currently exploring: 12

13 Selected relevant association rule sets can be precomputed and stored in the warehouse to guarantee optimal response time. These precomputed rule sets can be incrementally updated upon periodical update of the warehouse data. Extraction statements can be progressively rened, e.g., by providing more restrictive selection conditions the system can store and exploit the result of previous processing steps to simplify the successive extraction of rules selected by the rened statements. A collection of dierent extraction algorithms (e.g., [4, 10]), each one appropriate for a dierent type of data distribution, can be easily incorporated in the system and used for dierent warehouse data distributions. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc.ACM SIGMOD Conference on Management of Data, pages 207{216, Washington, D.C., May British Columbia. [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Knowledge Discovery in Databases, volume 2. AAAI/MIT Press, Santiago, Chile, September [3] R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on a relational database system. KDD-96, pages 287{290, [4] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th VLDB Conference, Santiago, Chile, September [5] R. Agrawal and R. Srikant. Mining sequential patterns. In International Conference on Data Engineering, Taipei, Taiwan, March [6] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. Sigmod Record, 26(1):65{74, March [7] J. Han, S. H. Chee, and J. Y. Chiang. Issues for on-line analytical mining of data warehouses. In In Proceedings of SIGMOD-98 Workshop on Research Issues on Data Mining and knowledge Discovery, Seattle, Washington, USA., June [8] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A data mining query language for relational databases. In Proceedings of SIGMOD-96 Workshop on Research Issues on Data Mining and knowledge Discovery, [9] R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In Proceedings of the 22st VLDB Conference, Bombay, India, September [10] J. S. Park, M. Shen, and P. S. Yu. An eective hash based algorithm for mining association rules. In Proceedings of the ACM-SIGMOD International Conference on the Management of Data, San Jose, California, May [11] A. Savasere, E. Omiecinski, and S. Navathe. An ecient algorithm for mining association rules in large databases. In Proceedings of the 21st VLDB Conference, Zurich, Swizerland, [12] R. Srikant and R. Agrawal. Mining generalized association rules. In Proceedings of the 21st VLDB Conference, Zurich, Switzerland, September

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center Mining Association Rules with Item Constraints Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120, U.S.A. fsrikant,qvu,ragrawalg@almaden.ibm.com

More information

Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach

Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach Gediminas Adomavicius Computer Science Department Alexander Tuzhilin Leonard N. Stern School of Business Workinq Paper Series

More information

Discovery of Association Rules in Temporal Databases 1

Discovery of Association Rules in Temporal Databases 1 Discovery of Association Rules in Temporal Databases 1 Abdullah Uz Tansel 2 and Necip Fazil Ayan Department of Computer Engineering and Information Science Bilkent University 06533, Ankara, Turkey {atansel,

More information

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo

More information

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day.

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day. Breaking the Barrier of Transactions: Mining Inter-Transaction Association Rules Anthony K. H. Tung 1 Hongjun Lu 2 Jiawei Han 1 Ling Feng 3 1 Simon Fraser University, British Columbia, Canada. fkhtung,hang@cs.sfu.ca

More information

Tadeusz Morzy, Maciej Zakrzewicz

Tadeusz Morzy, Maciej Zakrzewicz From: KDD-98 Proceedings. Copyright 998, AAAI (www.aaai.org). All rights reserved. Group Bitmap Index: A Structure for Association Rules Retrieval Tadeusz Morzy, Maciej Zakrzewicz Institute of Computing

More information

Item Set Extraction of Mining Association Rule

Item Set Extraction of Mining Association Rule Item Set Extraction of Mining Association Rule Shabana Yasmeen, Prof. P.Pradeep Kumar, A.Ranjith Kumar Department CSE, Vivekananda Institute of Technology and Science, Karimnagar, A.P, India Abstract:

More information

Data Access Paths for Frequent Itemsets Discovery

Data Access Paths for Frequent Itemsets Discovery Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number

More information

Materialized Data Mining Views *

Materialized Data Mining Views * Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61

More information

Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center

Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center Ecient Parallel Data Mining for Association Rules Jong Soo Park, Ming-Syan Chen and Philip S. Yu IBM Thomas J. Watson Research Center Yorktown Heights, New York 10598 jpark@cs.sungshin.ac.kr, fmschen,

More information

Income < 50k. low. high

Income < 50k. low. high A Multi-Tier Architecture for High-Performance Data Mining Ralf Rantzau and Holger Schwarz University of Stuttgart, Institute of Parallel and Distributed High-Performance Systems (IPVR), Breitwiesenstr.

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

Efficient integration of data mining techniques in DBMSs

Efficient integration of data mining techniques in DBMSs Efficient integration of data mining techniques in DBMSs Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex, FRANCE {bentayeb jdarmont

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Data Mining Query Scheduling for Apriori Common Counting

Data Mining Query Scheduling for Apriori Common Counting Data Mining Query Scheduling for Apriori Common Counting Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,

More information

SQL Based Frequent Pattern Mining with FP-growth

SQL Based Frequent Pattern Mining with FP-growth SQL Based Frequent Pattern Mining with FP-growth Shang Xuequn, Sattler Kai-Uwe, and Geist Ingolf Department of Computer Science University of Magdeburg P.O.BOX 4120, 39106 Magdeburg, Germany {shang, kus,

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method Preetham Kumar, Ananthanarayana V S Abstract In this paper we propose a novel algorithm for discovering multi

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu, Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

A New SQL-like Operator for Mining Association Rules. Rosa Meo. Dipartimento di Automatica e Informatica, Politecnico di Torino, Italy

A New SQL-like Operator for Mining Association Rules. Rosa Meo. Dipartimento di Automatica e Informatica, Politecnico di Torino, Italy A New SQL-like Operator for Mining Association Rules Rosa Meo Dipartimento di Automatica e Informatica, Politecnico di Torino, Italy rosimeo@polito.it Giuseppe Psaila Dipartimento di Automatica e Informatica,

More information

signicantly higher than it would be if items were placed at random into baskets. For example, we

signicantly higher than it would be if items were placed at random into baskets. For example, we 2 Association Rules and Frequent Itemsets The market-basket problem assumes we have some large number of items, e.g., \bread," \milk." Customers ll their market baskets with some subset of the items, and

More information

the probabilistic network subsystem

the probabilistic network subsystem Discovering Quasi-Equivalence Relationships from Database Systems Mei-Ling Shyu Shu-Ching Chen R. L. Kashyap School of Electrical and School of Computer Science School of Electrical and Computer Engineering

More information

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India

More information

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study Mirzaei.Afshin 1, Sheikh.Reza 2 1 Department of Industrial Engineering and

More information

Association Rules Mining:References

Association Rules Mining:References Association Rules Mining:References Zhou Shuigeng March 26, 2006 AR Mining References 1 References: Frequent-pattern Mining Methods R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm

More information

Discovering interesting rules from financial data

Discovering interesting rules from financial data Discovering interesting rules from financial data Przemysław Sołdacki Institute of Computer Science Warsaw University of Technology Ul. Andersa 13, 00-159 Warszawa Tel: +48 609129896 email: psoldack@ii.pw.edu.pl

More information

KNOWLEDGE DISCOVERY AND DATA MINING

KNOWLEDGE DISCOVERY AND DATA MINING KNOWLEDGE DISCOVERY AND DATA MINING Prof. Fabio A. Schreiber Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION MANAGEMENT TECHNOLOGIES DATA WAREHOUSE DECISION SUPPORT SYSTEMS

More information

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.

More information

Lecture 2 Wednesday, August 22, 2007

Lecture 2 Wednesday, August 22, 2007 CS 6604: Data Mining Fall 2007 Lecture 2 Wednesday, August 22, 2007 Lecture: Naren Ramakrishnan Scribe: Clifford Owens 1 Searching for Sets The canonical data mining problem is to search for frequent subsets

More information

Applying Objective Interestingness Measures. in Data Mining Systems. Robert J. Hilderman and Howard J. Hamilton. Department of Computer Science

Applying Objective Interestingness Measures. in Data Mining Systems. Robert J. Hilderman and Howard J. Hamilton. Department of Computer Science Applying Objective Interestingness Measures in Data Mining Systems Robert J. Hilderman and Howard J. Hamilton Department of Computer Science University of Regina Regina, Saskatchewan, Canada SS 0A fhilder,hamiltong@cs.uregina.ca

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

Novel Materialized View Selection in a Multidimensional Database

Novel Materialized View Selection in a Multidimensional Database Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/

More information

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data Shilpa Department of Computer Science & Engineering Haryana College of Technology & Management, Kaithal, Haryana, India

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

A Literature Review of Modern Association Rule Mining Techniques

A Literature Review of Modern Association Rule Mining Techniques A Literature Review of Modern Association Rule Mining Techniques Rupa Rajoriya, Prof. Kailash Patidar Computer Science & engineering SSSIST Sehore, India rprajoriya21@gmail.com Abstract:-Data mining is

More information

New Orleans, Louisiana, February/March Knowledge Discovery from Telecommunication. Network Alarm Databases. K. Hatonen M. Klemettinen H.

New Orleans, Louisiana, February/March Knowledge Discovery from Telecommunication. Network Alarm Databases. K. Hatonen M. Klemettinen H. To appear in the 12th International Conference on Data Engineering (ICDE'96), New Orleans, Louisiana, February/March 1996. Knowledge Discovery from Telecommunication Network Alarm Databases K. Hatonen

More information

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Galina Bogdanova, Tsvetanka Georgieva Abstract: Association rules mining is one kind of data mining techniques

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Hierarchical Online Mining for Associative Rules

Hierarchical Online Mining for Associative Rules Hierarchical Online Mining for Associative Rules Naresh Jotwani Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar 382009 INDIA naresh_jotwani@da-iict.org Abstract Mining

More information

arxiv: v1 [cs.db] 10 May 2007

arxiv: v1 [cs.db] 10 May 2007 Decision tree modeling with relational views Fadila Bentayeb and Jérôme Darmont arxiv:0705.1455v1 [cs.db] 10 May 2007 ERIC Université Lumière Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

More information

Chapter 4 Data Mining A Short Introduction

Chapter 4 Data Mining A Short Introduction Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview

More information

ESTIMATING HASH-TREE SIZES IN CONCURRENT PROCESSING OF FREQUENT ITEMSET QUERIES

ESTIMATING HASH-TREE SIZES IN CONCURRENT PROCESSING OF FREQUENT ITEMSET QUERIES ESTIMATING HASH-TREE SIZES IN CONCURRENT PROCESSING OF FREQUENT ITEMSET QUERIES Pawel BOINSKI, Konrad JOZWIAK, Marek WOJCIECHOWSKI, Maciej ZAKRZEWICZ Institute of Computing Science, Poznan University of

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a Multi-Layer Incremental Induction Xindong Wu and William H.W. Lo School of Computer Science and Software Ebgineering Monash University 900 Dandenong Road Melbourne, VIC 3145, Australia Email: xindong@computer.org

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Computing Data Cubes Using Massively Parallel Processors

Computing Data Cubes Using Massively Parallel Processors Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li {luhj,huangxia,lizhixia}@iscs.nus.edu.sg Department of Information Systems and Computer Science National University

More information

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing An Ecient Algorithm for Mining Association Rules in Large Databases Ashok Savasere Edward Omiecinski Shamkant Navathe College of Computing Georgia Institute of Technology Atlanta, GA 3332 e-mail: fashok,edwardo,shamg@cc.gatech.edu

More information

Performance Improvements. IBM Almaden Research Center. Abstract. The problem of mining sequential patterns was recently introduced

Performance Improvements. IBM Almaden Research Center. Abstract. The problem of mining sequential patterns was recently introduced Mining Sequential Patterns: Generalizations and Performance Improvements Ramakrishnan Srikant? and Rakesh Agrawal fsrikant, ragrawalg@almaden.ibm.com IBM Almaden Research Center 650 Harry Road, San Jose,

More information

1 Introduction The step following the data mining step in the KDD process consists of the interpretation of the data mining results [6]. This post-pro

1 Introduction The step following the data mining step in the KDD process consists of the interpretation of the data mining results [6]. This post-pro Decision support queries for the interpretation of data mining results Bart Goethals, Jan Van den Bussche, Koen Vanhoof Limburgs Universitair Centrum Abstract The interpretation of data mining results

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

Association Rule Mining from XML Data

Association Rule Mining from XML Data 144 Conference on Data Mining DMIN'06 Association Rule Mining from XML Data Qin Ding and Gnanasekaran Sundarraj Computer Science Program The Pennsylvania State University at Harrisburg Middletown, PA 17057,

More information

Data warehouses Decision support The multidimensional model OLAP queries

Data warehouses Decision support The multidimensional model OLAP queries Data warehouses Decision support The multidimensional model OLAP queries Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing

More information

Value Added Association Rules

Value Added Association Rules Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency

More information

A Mining Algorithm to Generate the Candidate Pattern for Authorship Attribution for Filtering Spam Mail

A Mining Algorithm to Generate the Candidate Pattern for Authorship Attribution for Filtering Spam Mail A Mining Algorithm to Generate the Candidate Pattern for Authorship Attribution for Filtering Spam Mail Khongbantabam Susila Devi #1, Dr. R. Ravi *2 1 Research Scholar, Department of Information & Communication

More information

Association Rules Extraction with MINE RULE Operator

Association Rules Extraction with MINE RULE Operator Association Rules Extraction with MINE RULE Operator Marco Botta, Rosa Meo, Cinzia Malangone 1 Introduction In this document, the algorithms adopted for the implementation of the MINE RULE core operator

More information

On-Line Application Processing

On-Line Application Processing On-Line Application Processing WAREHOUSING DATA CUBES DATA MINING 1 Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming,

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures) CS614- Data Warehousing Solved MCQ(S) From Midterm Papers (1 TO 22 Lectures) BY Arslan Arshad Nov 21,2016 BS110401050 BS110401050@vu.edu.pk Arslan.arshad01@gmail.com AKMP01 CS614 - Data Warehousing - Midterm

More information

Bitmap index-based decision trees

Bitmap index-based decision trees Bitmap index-based decision trees Cécile Favre and Fadila Bentayeb ERIC - Université Lumière Lyon 2, Bâtiment L, 5 avenue Pierre Mendès-France 69676 BRON Cedex FRANCE {cfavre, bentayeb}@eric.univ-lyon2.fr

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This

More information

An Overview of Cost-based Optimization of Queries with Aggregates

An Overview of Cost-based Optimization of Queries with Aggregates An Overview of Cost-based Optimization of Queries with Aggregates Surajit Chaudhuri Hewlett-Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94304 chaudhuri@hpl.hp.com Kyuseok Shim IBM Almaden Research

More information

Mining Frequent Patterns with Counting Inference at Multiple Levels

Mining Frequent Patterns with Counting Inference at Multiple Levels International Journal of Computer Applications (097 7) Volume 3 No.10, July 010 Mining Frequent Patterns with Counting Inference at Multiple Levels Mittar Vishav Deptt. Of IT M.M.University, Mullana Ruchika

More information

Generating Cross level Rules: An automated approach

Generating Cross level Rules: An automated approach Generating Cross level Rules: An automated approach Ashok 1, Sonika Dhingra 1 1HOD, Dept of Software Engg.,Bhiwani Institute of Technology, Bhiwani, India 1M.Tech Student, Dept of Software Engg.,Bhiwani

More information

Data Mining Support in Database Management Systems

Data Mining Support in Database Management Systems Data Mining Support in Database Management Systems Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology, Poland {morzy,marek,mzakrz}@cs.put.poznan.pl Abstract. The most

More information

Object A A A B B B B D. timestamp events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7

Object A A A B B B B D. timestamp events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 A Universal Formulation of Sequential Patterns Mahesh Joshi George Karypis Vipin Kumar Department of Computer Science University of Minnesota, Minneapolis fmjoshi,karypis,kumarg@cs.umn.edu Technical Report

More information

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December

More information

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining. Introduction 46. Study core 46 Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FP-Growth Algorithm 47 4 Assignment Bundle: Frequent

More information

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati Analytical Representation on Secure Mining in Horizontally Distributed Database Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering

More information

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:- UNIT III: Data Warehouse and OLAP Technology: An Overview : What Is a Data Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to

More information

User accesses business site. Recommendations Engine. Recommendations to user 3 Data Mining for Personalization

User accesses business site. Recommendations Engine. Recommendations to user 3 Data Mining for Personalization Personalization and Location-based Technologies for E-Commerce Applications K. V. Ravi Kanth, and Siva Ravada Spatial Technologies, NEDC, Oracle Corporation, Nashua NH 03062. fravi.kothuri, Siva.Ravadag@oracle.com

More information

Development of Efficient & Optimized Algorithm for Knowledge Discovery in Spatial Database Systems

Development of Efficient & Optimized Algorithm for Knowledge Discovery in Spatial Database Systems Development of Efficient & Optimized Algorithm for Knowledge Discovery in Spatial Database Systems Kapil AGGARWAL, India Key words: KDD, SDBS, neighborhood graph, neighborhood path, neighborhood index

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 05(b) : 23/10/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing

More information

PATTERN DISCOVERY IN TIME-ORIENTED DATA

PATTERN DISCOVERY IN TIME-ORIENTED DATA PATTERN DISCOVERY IN TIME-ORIENTED DATA Mohammad Saraee, George Koundourakis and Babis Theodoulidis TimeLab Information Management Group Department of Computation, UMIST, Manchester, UK Email: saraee,

More information

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Efficient Remining of Generalized Multi-supported Association Rules under Support Update Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou

More information

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery

More information

Parallel Algorithms for Discovery of Association Rules

Parallel Algorithms for Discovery of Association Rules Data Mining and Knowledge Discovery, 1, 343 373 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Parallel Algorithms for Discovery of Association Rules MOHAMMED J. ZAKI SRINIVASAN

More information

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag. Database Management Systems DBMS Architecture SQL INSTRUCTION OPTIMIZER MANAGEMENT OF ACCESS METHODS CONCURRENCY CONTROL BUFFER MANAGER RELIABILITY MANAGEMENT Index Files Data Files System Catalog DATABASE

More information

AB AC AD BC BD CD ABC ABD ACD ABCD

AB AC AD BC BD CD ABC ABD ACD ABCD LGORITHMS FOR OMPUTING SSOITION RULES USING PRTIL-SUPPORT TREE Graham Goulbourne, Frans oenen and Paul Leng Department of omputer Science, University of Liverpool, UK graham g, frans, phl@csc.liv.ac.uk

More information

Mining Association Rules from XML Data

Mining Association Rules from XML Data Mining Association Rules from XML Data Daniele Braga 1, Alessandro Campi 1, Mika Klemettinen 2, and PierLuca Lanzi 3 1 Politecnico di Milano, Dipartimento di Elettronica e Informazione, P.za L. da Vinci

More information

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t Data Reduction - an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.

More information

Database Systems Concepts *

Database Systems Concepts * OpenStax-CNX module: m28156 1 Database Systems Concepts * Nguyen Kim Anh This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract This module introduces

More information

Mining Association Rules in Data Warehouses

Mining Association Rules in Data Warehouses IDEA GROUP PUBLISHING 28 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661;

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Data Warehouse and Data Mining

Data Warehouse and Data Mining Data Warehouse and Data Mining Lecture No. 03 Architecture of DW Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Basic

More information

Syllabus. Syllabus. Motivation Decision Support. Syllabus

Syllabus. Syllabus. Motivation Decision Support. Syllabus Presentation: Sophia Discussion: Tianyu Metadata Requirements and Conclusion 3 4 Decision Support Decision Making: Everyday, Everywhere Decision Support System: a class of computerized information systems

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Slides for Textbook Chapter 4 October 17, 2006 Data Mining: Concepts and Techniques 1 Chapter 4: Data Mining Primitives, Languages, and System Architectures Data mining

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori

Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori Przemyslaw Grudzinski 1, Marek Wojciechowski 2 1 Adam Mickiewicz University Faculty of Mathematics

More information

A Fast Distributed Algorithm for Mining Association Rules

A Fast Distributed Algorithm for Mining Association Rules A Fast Distributed Algorithm for Mining Association Rules David W. Cheung y Jiawei Han z Vincent T. Ng yy Ada W. Fu zz Yongjian Fu z y Department of Computer Science, The University of Hong Kong, Hong

More information

Product presentations can be more intelligently planned

Product presentations can be more intelligently planned Association Rules Lecture /DMBI/IKI8303T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, Objectives Introduction What is Association Mining? Mining Association Rules

More information

Optimizing subset queries: a step towards SQL-based inductive databases for itemsets

Optimizing subset queries: a step towards SQL-based inductive databases for itemsets 0 ACM Symposium on Applied Computing Optimizing subset queries: a step towards SQL-based inductive databases for itemsets Cyrille Masson INSA de Lyon-LIRIS F-696 Villeurbanne, France cyrille.masson@liris.cnrs.fr

More information

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Marek Wojciechowski, Krzysztof Galecki, Krzysztof Gawronek Poznan University of Technology Institute of Computing Science ul.

More information

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

An Improved Algorithm for Mining Association Rules Using Multiple Support Values An Improved Algorithm for Mining Association Rules Using Multiple Support Values Ioannis N. Kouris, Christos H. Makris, Athanasios K. Tsakalidis University of Patras, School of Engineering Department of

More information

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE Saravanan.Suba Assistant Professor of Computer Science Kamarajar Government Art & Science College Surandai, TN, India-627859 Email:saravanansuba@rediffmail.com

More information