Sales cust id product id. day id. price qty. Time. day id. day month year...

Size: px

Start display at page:

Download "Sales cust id product id. day id. price qty. Time. day id. day month year..."

Samson Lindsey
5 years ago
Views:

1 Data Mining in Data Warehouses Elena Baralis x Rosa Meo x Giuseppe Psailay x Politecnico di Torino, Dipartimento di Automatica e Informatica Corso Duca degli Abruzzi, 24 - I Torino, Italy y Dipartimento di Elettronica e Informazione, Politecnico di Milano P.za Leonardo da Vinci Milano, Italy baralis/rosimeo@polito.it, psaila@elet.polimi.it Abstract Data warehouses provide an integrated environment where huge amounts of data extracted from operational sources are available for various kinds of decision support analysis. Hence, in order to allow the user to improve the quality of the performed analysis, it is becoming of fundamental importance to eectively integrate mining capabilities and data warehousing technology. This paper describes AMORE-DW, an integrated environment for the specication of data mining requests and the extraction of association rules from a data warehouse. The adopted architecture is characterized by a tight coupling of data mining with the relational OLAP (ROLAP) server on the data warehouse, that provides ecient access to the data to be analyzed. The main issues faced during the design are presented and the trade-o between exible data analysis and system performance is discussed. 1 Introduction The availability of an ecient and reliable database technology allows the massive and systematic gathering of huge amounts of operational data concerning every kind of human activity, such as business transactions or scientic research. Before being analyzed, the raw data, possibly extracted from heterogeneous sources, needs to be properly integrated and carefully cleaned, in order to allow the extraction of reliable information. Furthermore, data analysis algorithms perform complex (and expensive) operations on data, which are best performed in a separated environment to avoid hampering daily operational data processing. The data warehousing technology [6], which experienced an explosive growth in the past few years, is able to provide an integrated environment where data extracted from operational sources are available for dierenttypes of decision support analysis. Hence, data warehouses are expected to naturally become a major platform for data mining [7]. To extract useful information that may be exploited, e.g., in business decision making, the data stored in the data warehouse must be analyzed with appropriate techniques. While OLAP (On Line Analytical Processing) analysis is devoted to the computation of complex aggregations on the data, data mining is focused on the extraction of the most frequent regularities that characterize the data. Such regularities are described by means of specic models, which give a more abstract description of the data. An important class of data mining problems is represented by means of association rules. Association rules describe the most common links among data items in a large amount of collected data. The \classical" example application of association rules discovery [1] is the analysis of data recording customer purchases at supermarket checkouts. In this context, association rules describe which products are likely to be bought together by a statistically relevant number of customers.the discovered information can then be used by the store management to support strategical decisions, e.g., for planning and marketing. In general, an association rule is characterized by the structure: This work has been supported by the Interdata MURST grant. 1

2 X )Y where X, the rule body, and Y, the rule head, are two sets of values drawn from the mined attribute, i.e., the attribute whose behavior is observed (e.g., bought items in the above example). To perform rule extraction, data are grouped by some attribute (e.g., customer transactions) rules describe regularities of the mined attribute with respect to the groups. The relevance of a rule is expressed in terms of the frequency with which a rule is observed in the analyzed data. Thus, two probability measures, called support and condence, are associated to a rule. The support is the joint probability tondin the same group X and Y. The condence is the conditional probability to nd in a group Y having found X. Two minimum thresholds for support and condence are dened with the purpose to discard the less frequent associations in the database. The integration of data mining techniques with data warehousing technology will enhance both the data analysis capabilities provided by current data warehouse products and the expressive power and exibility of the analysis performed by current data mining tools. In fact, the current commercial ROLAP (Relational OLAP) servers provide both the powerful data retrieval services of relational DBMS servers and ad-hoc OLAP optimization techniques. This in turn allows the data analysts to specify complex (more rened than current data mining tools would allow) search criteria in order to extract more useful knowledge from the raw data stored in the warehouse. These considerations inspired the AMORE-DW (Advanced Mining On Relational Environments - Data Warehousing) project. The main goal is the development of a mining tool tightly integrated with the data warehouse and its ROLAP server, such that the source data are constituted by the data collected in the data warehouse, and the extracted rules are represented as database relations. In this context, the description of mining requests is performed by means of an SQL-like language, that allows a exible specication of mining statements and extends the semantics of other languages [8] this operator also provides specic constructs to deal with the data schema typical of data warehouses (known as star schema), in order to simplify the specication of the mining request and to allow the data mining tool to perform appropriate optimizations to reduce the cost of the analysis. In this paper we illustrate the main features of the AMORE-DWenvironmentandwe discuss the design issues and the problems encountered in integrating data mining and data warehousing technology. In particular, in Section 2 we describe a declarative language for the specication of mining requests, based on the SQL-like operator MINE RULE, while in Section 3 we present the architecture of the AMORE-DW prototype. Section 4 discusses the trade-o between exible data analysis and system performance when dening which tasks of the rule extraction process are to be performed by the OLAP server or by special-purpose extraction algorithms. Section 5 draws conclusions. 2 Specication Language for Association Rules Extraction Several algorithms have been proposed to extract specic types of association rules (e.g. \classical" association rules [1], generalized association rules [12],...) which operate on xed format data and are tailored to the specic type of association rules to be searched. We propose a general purpose, SQL-like language to declaratively specify the features of the association rules to be extracted from the data warehouse. Since this language is not bound to any specic data schema, it allows the user to freely search through the whole schema of the data warehouse. Furthermore, it allows the specication of complex extraction criteria, which are not available in traditional association rule extraction prototypes. Thus, the user is able to restrict the search space by progressively rening the characteristics of the association rules to be extracted. An initial description of MINE RULE, the main operator of the language, can be found in [9] it is extended here to cope with the specic features of data warehouses. The operator is introduced by means of an example, which species several complex extraction criteria. The example is based on the data warehouse schema, describing sales in a supermarket, which is represented in Figure 1. 2

3 Customers cust id name address birth year city city id region region id... Sales cust id product id day id price qty Time day id day month year... Products product id product subcategory id subcategory category id category... Figure 1: Star schema of the supermarket data warehouse bold attribute names represent the primary key of each table. cust-id product-id day-id price qty c-1 p-1 d c-1 p-2 d c-2 p-3 d c-2 p-4 d c-2 p-5 d c-1 p-4 d c-2 p-3 d c-2 p-5 d (a) cust-id day-id product-id price qty p c-1 d-1 p d-2 p p d-2 p c-2 p p d-3 p (b) Figure 2: The Sales fact table: (a) simplied instance, (b) grouped by cust-id and clustered by day-id. Since data warehouse schemas usually have a star topology, they are called star schemas. In this example, the star center, called fact table, describes sales gures for a supermarket. In particular, each purchase is performed by a specic customer (supposing the store uses customer cards to identify its customers) to buy a specic product (identied, e.g., by means of the bar code) in a specic date. The above informations are the main dimensions along which each sale event is characterized. Further information on each dimension can be obtained by the appropriate dimension table describing in more detail the features of customers (e.g., demographic information, complete address,... ), products (e.g., merchandise hierarchy, given by product sub-category and category, and other attributes, such as package type, and many more), and time (e.g., time hierarchy, given by day, week or month, year, and other attributes useful for tracking sales such as holiday and special events indication,...). Each sale fact in the Sales table is further characterized by two measures: the price and the quantity of the sold product. We nally observe that non numerical attributes (e.g., category in table Products) are included in the dimension tables also in encoded format 1. This format is usually used in data processing instead of the original one for eciency reasons. For the sake of simplicity, to illustrate the expressive power of the MINE RULE operator, we consider the reduced instance of the Sales table presented in Figure 2(a). Suppose we want to extract association rules with the following features: a) Rules describe the behavior of customers in terms of the sets of products most frequently purchased by them. b) Only customers born after 1970 that purchased at least two products are considered. 1 The attributes are encoded during the loading process and the periodical refresh of the data warehouse. Note that dimensions are seldom updated during the periodical refresh of the data warehouse. 3

4 c) Products appearing in the body must be purchased in the same date, after October. Products in the head are purchased in the same date, but after the products in the corresponding body. d) Products in the body have a price less than or equal to 200$, whereas products in the head cost less than products in the body. e) Rules are interesting only if their support is at least 20% and their condence is at least 30% 2. The following statement allows the extraction of association rules corresponding to the above specication. MINE RULE YoungCustomers AS SELECT DISTINCT 1..n product-id->product AS BODY, 1..n product-id->product AS HEAD, SUPPORT, CONFIDENCE WHERE BODY.price <= 200 AND HEAD.price < BODY.price FROM Sales GROUP BY cust-id HAVING COUNT(*) >= 2 AND cust-id->birth-year > 1970 CLUSTER BY day-id HAVING BODY.day-id->day < HEAD.day-id->day AND BODY.day-id->month > 10 EXTRACTING RULES WITH SUPPORT: 0.2, CONFIDENCE: 0.3 The association rules are extracted by performing the following steps: Data Source. The FROM clause species the source data to analyze. Only the fact table needs to be specied. To reference attributes in the dimension tables, the referencing \->" operator is used to follow the foreign key constraint from the fact table to the dimension table. For example, the attribute birth-year of the Customers dimension is reached through the attribute cust-id. Observe that the analysis always involves the data contained in the fact table the dimensions may bereferenced as well, but never in absence of the fact table. Group computation. The GROUP BY clause species that the source relation Sales is logically partitioned into groups of tuples having the same value for the grouping attribute cust-id (corresponding to feature (a) above). Group ltering. The (optional) HAVING clause associated to the GROUP BY clause discards, before rule extraction, all groups with less than two tuples or whose customer is born before 1970 (corresponding to feature (b) above). Cluster identication. The (optional) CLUSTER BY clause further partitions each group into sub-groups called clusters, such that tuples in a cluster have the same value for the clustering attribute day-id. The result of both grouping and clustering of the data instance in Figure 2(a) is represented In Figure 2(b). When clustering is specied, the body of a rule is extracted from (smaller) clusters instead of entire groups, and analogously for rule heads. Thus elements in the body (and head) share the same value of the clustering attribute (corresponding to feature (c) above). Cluster coupling. To compose rules, every pair of clusters (one for the body and one for the head) inside the same group is considered. Furthermore, the optional HAVING clause of the CLUSTER BY clause selects the cluster pairs that should be considered for extracting rules. In this case, a pair of clusters is considered only if the day ofthe left hand cluster, the body cluster, precedes the day of the right hand cluster, the head cluster (corresponding to feature (c) above). 2 These gures are meaningful only for the example data warehouse in real applications they would be signicantly lower. 4

5 Rule extraction. From each group, the SELECT clause extracts all possible associations of an unlimited set of products (clause 1..n product-id->product AS BODY), representing the body of rules, with another set of products (clause 1..n product-id->product AS HEAD), representing the head of rules (corresponding to feature (a) above). Mining condition. The (optional) WHERE clause following the SELECT clause forces rule extraction to consider only pairs of tuples, the rst one (called body tuple) coming from the body cluster and the second one (called head tuple) coming from the head cluster, such that the value of attribute price in the body tuple is less than or equal to 200, and it is higher than the value of attribute price in the head tuple (corresponding to feature (d) above). Support and condence evaluation. The support of a rule is the number of groups from which the rule is extracted divided by the total number of groups generated by the GROUP BY clause. The condence is the number of groups from which the rule is extracted divided by thenumber of groups that contain the body in some cluster. When support or condence are lower than the respective minimum thresholds (20=100 = 0:2 for support and 30=100 = 0:3 for condence in our sample statement, see feature (e)), the rule is discarded. This is the rule set extracted from the Sales table instance presented in Figure 2(a): Rule Support Condence fbrown bootsg!fcol shirtsg fjacketsg!fcol shirtsg fbrown boots,jacketsg!fcol shirtsg where brown boots corresponds to product identier p-3, jackets to p-4, col shirts to p-5. Since both rule bodies and heads are sets, they can be represented as relational set attributes, similarly to the forthcoming SQL3 standard. The statement described above species very complex extraction criteria and exploits most of the powerful features provided by the MINE RULE operator. Simpler extraction criteria are normally specied in the initial phase of the search process mining requests are then progressively rened adding more complex ltering conditions. Rules can be divided in several (orthogonal) classes, depending on the extraction criteria that guide the extraction process: Simple Association Rules. Only the basic (mandatory) extraction criteria are specied (source data, grouping attribute, mined attribute, body and head cardinality, minimum support and condence). This class corresponds to classical association rules extraction [1]. Filtered Association Rules. A group ltering condition is specied: rules are extracted only from a selected subset of groups. Mining-Constrained Association Rules. A mining condition expresses a complex correlation between rule body and rule head. Clustering Association Rules. The clustering attributes, and possibly the cluster selection predicate, are speci- ed. Bodies and heads of the rules are extracted from clusters instead of entire groups. These classes are not mutually exclusive. Hence, mining problems can be expressed as a combination of the described extraction criteria. 5

6 Translator Mining Statement Mining Kernel ROLAP SQL Server Extracted Rules 3 AMORE-DW Architecture Figure 3: Mining Server architecture. The AMORE-DW prototype is based on a client-server architecture: On the client side, by means of a suite of user-friendly interface tools, the user species mining requests, which are then automatically mapped to statements in the language for the extraction of association rules described in Section 2. Then, the mining statement is submitted to the mining server, which performs the rule extraction. Finally, on the client side, the user may browse the result of the extraction process. The mining server, whichisincharge of the actual rule extraction, is tightly coupled with the ROLAP server, in order to exploit its powerful data manipulation language and its data storage and access facilities. Typically, both servers (the mining server and the ROLAP server) should be resident on the same system, which is usually devoted to run computationally expensive analysis jobs. In the following, we focus on the description of the architecture of the mining server presenting in more detail all its components and their interaction. 3.1 Mining Server Architecture The architecture of the Mining Server is depicted in Figure 3. It encompasses the following components: The Translator interprets the MINE RULE mining statement: it checks the correctness of the request and generates the processing directives for the Mining Kernel. The Mining Kernel is the specialized component for data analysis and association rules extraction. The ROLAP Server prepares the data for the analysis according to the Mining Kernel's directives, provides ecient access to them and stores the analysis results. When a mining request is received by the Mining Server, the following processing steps take place (see Figure 3, where solid arrows indicate the information ow between the components of the system): 1. The Translator performs the syntactic and semantic verication of the statement's correctness. It checks the denition of table and attribute names referenced in the statement into the ROLAP server Data Dictionary. This information ow is represented in Figure 3 by the edge connecting the ROLAP Server to the Translator. 2. The Translator extracts the features characterizing the submitted statement and generates processing directives for the Mining Kernel. In the gure, the processing directives sent to the Mining Kernel are represented by the edge from the Translator to the Mining Kernel. 6

7 3. The Mining Kernel is activated upon receiving the processing directives from the Translator. It extracts association rules by performing the following tasks: (a) It instructs the ROLAP Server to preprocess the data (see Section 3.3.1). (b) It reads the data and composes association rules (see Section 3.3.2). (c) It stores the extracted rules into the data warehouse. In the gure, the bidirectional edge between the Mining Kernel and the ROLAP Server denotes the exchange of information in both directions. At this point, the extraction process is completed and the user can browse the obtained association rules on the ROLAP Server. An important issue in the implementation of the Mining Server is the identication of the border between typical data processing tasks, to be executed by the ROLAP server, and mining processing, performed by specialized algorithms in the Mining Kernel. At one extreme, the ROLAP server could be used only to retrieve raw data, as in the traditional approach. In this case, the mining algorithm would be overloaded by tasks such as the evaluation of complex SQL predicates on raw data. At the opposite extreme, the entire rule extraction could be performed by means of SQL programs executed by the ROLAP server. This solution is not ecient, since an excessively frequent context-switching between the SQL context and the mining application context would occur [3]. In particular, to decide which tasks are better performed by each server, we considered the following issues: The presence of the reference operator -> in the MINE RULE statements requires some preprocessing to join the data referenced by means of the -> operator. This task is eciently performed by the ROLAP Server. SQL predicates allowed by the MINE RULE operator are eectively evaluated by the ROLAP Server without overloading the mining algorithm. The actual association rules computation is an iterative process [1, 4, 11] that is better performed in main memory with suitable data structures by a specialized algorithm. Hence, many of the extraction features that characterize the MINE RULE operator can be eciently delegated to the ROLAP server. The pool of operations performed by the ROLAP server is embedded into an SQL package called preprocessor (presented in Section 3.3.1). The actual rule discovery process is carried out by a specialized algorithm, in the Mining Kernel, which is called the core operator (presented in Section 3.3.2). 3.2 Translator The translator is in charge of the following tasks: It executes any lexical, syntactic and semantic check on the mining statement. Semantic checks range from the simple verication of correctness of the statement (e.g., the source specication should reference tables existing in the database dictionary), to the enforcement of constraints on the mining statement (e.g., the set of attributes in the GROUP BY clause should be disjoint from the set of attributes in the other clauses, such as the CLUSTER BY or the SELECT clause). It maps all non-numerical attributes (e.g., strings) referenced by the mining statement in non-conditional clauses into the corresponding encoded version 3. Non-numerical attribute encoding allows a signicant eciency improvement for both the preprocessor and the core operator. 3 The correspondence is permanently stored into appropriate meta-data tables. Recall that the values of both the original attribute and its encoded version are stored in the data warehouse. 7

8 It analyzes the syntactic clauses in the MINE RULE statement, determining the class of mining statements to which the current statement belongs, in order to generate the processing directives for the Mining Kernel. MINE RULE statements are classied according to the following categories: Basic statements: all the mandatory clauses are included (MINE RULE, SELECT, FROM, GROUP BY, EXTRACTING RULES WITH). Optionally, the selection of source data (WHERE predicate in the FROM clause) and the selection of relevant groups (HAVING predicate following the GROUP BY clause) may be specied. An arbitrary cardinality is allowed both in body and head. Complex statements: optional features, such as mining condition, clustering attribute(s) (CLUSTER BY clause) and cluster selection predicates (HAVING predicate following the CLUSTER BY clause) are speci- ed. Finally, the translator calls the Mining Kernel, passing to it all the information required for further processing. 3.3 Mining Kernel The Mining Kernel performs the extraction of association rules. Its structure can be divided in two main blocks: The Preprocessor, that performs raw data preprocessing, and yields data in a suitable format for the rule extraction algorithms in the core operator. The Core Operator, which performs the actual rule extraction Preprocessor Depending on the class of mining statements specied by the user, the preprocessor executes distinct SQL programs whose execution provides to the core operator the appropriate set of input data. These programs include database instructions that retrieve source data, evaluate selection predicates and association conditions, and prepare data in the appropriate format for the core operator. This component heavily exploits the services of the ROLAP Server and the specic data warehouse operating context. In particular, for example, it takes advantage of the available mechanisms to disable continuous logging of database operations, which is costly and useless in this operative context. The structure of the preprocessor is depicted in Figure 4. In the gure, ovals indicate SQL programs, while rectangles denote views or temporary tables. Directed edges show the processing ow: arcs entering a program denote its input tables, arcs exiting a program denote its output tables. Labeled arcs denote that the corresponding operation is executed only if the associated feature appears in the mining statement (a legend is reported in Figure 4) the vertical bar ( ) separating two labels denotes their disjunction. Depending on the class of mining statements, the Preprocessor executes the following operations: Basic statements. The preprocessor retrieves source data by means of program Q 0 its result, named Source, may be a view or a temporary table containing the result of joining the fact and the dimension tables. Then, program Q 1 counts the total number of groups in Source (needed by the Core Operator to compute rule support). If a group ltering condition is specied, program Q 2, which selects the set of groups that satisfy the condition, is also executed. Finally, program Q 3 generates the input for the Core Operator, named CoreSource. While in the simplest case CoreSource is a view, it may be temporarily materialized when the Core Operator needs to scan it several times. CoreSource includes only the tuples that must be inspected in the extraction process. 8

9 Complex statements. In this case, since the Preprocessor is in charge of the evaluation of all complex SQL predicates on data, it generates the so-called elementary rules, i.e. basic associations of one item for the body and one item for the head. Hence, some further processing steps are required with respect to the former case. In particular, if either a mining condition is specied, or clustering attributes and possibly a cluster condition are specied, program Q 5 reads view CoreSource and creates the elementary rules, storing them into the AllRules temporary table, together with the list of groups containing each rule. In presence of aggregate functions in the cluster condition, the execution of program Q 5 is preceeded by the execution of program Q 4, which produces table Clusters, where each cluster is associated to the corresponding value of the aggregate functions. This table is then used to compose the elementary rules in program Q 5. Finally, the set of elementary rules having support higher than or equal to the minimum support threshold (the large rules) are computed and stored in temporary table InputRules, which is the input of the Core Operator. In particular, program Q 6 counts the support of each rule and detects the large rules, while program Q 7 prepares table InputRules. Summarizing, the preprocessor produces two types of input data for the core operator: The Source Data. The core operator performs its analysis by reading only the tuples in the CoreSource table, i.e., the actual tuples to be considered in the extraction process. It is unaware of the actual origin of the source data. The Large Elementary Rules. These are the basic associations of one item for the body and one item for the head having support higher than or equal to the minimum support threshold, which are stored in temporary table InputRules. They are produced by theevaluation of the association conditions in the mining statement (e.g., the cluster or the mining condition) Core Operator The data processing tasks performed by the preprocessor allow the core operator to be independent of the selection conditions specied in the MINE RULE statement, e.g., HAVING conditions on clusters. This independence is essential for the simplicity and eciency of the implementation of the extraction algorithms. The features provided by our operator extend the semantics of the association rules with respect to other SQL-like operators proposed in literature [8]. This extended semantics requires that the core operator is implemented with original solutions, that need several adaptations of the well known algorithms proposed in literature [1, 4,11]. The Mining Kernel includes several algorithms, each one specically tailored for the categories of MINE RULE statements outlined in Section Basic statements. In this case, the extraction process corresponds to the traditional rule extraction performed by the algorithms described in [1, 4, 11, 10]. Rule discovery is performed by initially building sets of items with sucient support (called large itemsets), and then creating rules from the large itemsets. All these algorithms are based on the observation that the number of itemsets grows exponentially with the number of items. Since acounter in main memory is needed for each itemset whose support is computed, these itemsets are carefully selected and their number is kept as low as possible. The current version of the core operator for basic statements is based on the algorithm presented in [11]. However, the modularity of our architecture allows us to easily replace this algorithm with any of the algorithms proposed in literature (see [1, 2, 4, 11], etc.). The algorithm operates in two passes. In the rst, the source data is divided in partitions of the same size designed so as the counters for the itemsets can be stored in main memory. The search for large itemsets is performed separately in each partition. This step is based on 9

10 base tables Q0: create Source Source G Q2: create ValidGroups Q1: count total groups ValidGroups G totg Q3: create CoreSource CoreSource F Q4: compute aggregate functions M H C F Clusters Q5: create elementary rules All Rules M H C Q7: select large rules M H C Q6: compute large rules LargeRules M H C InputRules Legend Label G M H C F Meaning of the label Presence of the Group Filtering Condition Presence of the Mining Condition Dierent attribute schema for body and head Clustered statement Aggregate functions in the Cluster Filtering Condition Figure 4: Preprocessor Architecture. Labels on the arcs denote when they are enabled in the table is reported the meaning of the labels. 10

11 the observation that sucient support in at least a partition is a necessary condition for an itemset to have sucient support also with respect to the whole source data. The itemsets that have sucient support within a partition are saved on disk. The second pass computes the eective support of the saved itemsets, by counting the number of groups in the entire database that contain each itemset. The actual support is given by the ratio between this number and the total number of groups (computed by the preprocessor). The actual rule discovery process considers each large itemset previously generated and extracts subsets of items. Indicating with L a large itemset and with H L a subset, the rule (L ; H) ) H is formed. Its support is dened as the itemset support (supp(l)) condence is immediately computed ( supp(l) ). supp(l;h) Complex statements. In this case the core operator receives from the ROLAP server the elementary rules 4 from which rules can be extracted, instead of computing them itself. The algorithm discovers rules starting from formerly generated rules, as opposed to considering the whole itemsets as in the previous algorithm. Rule discovery is performed by steps, increasing progressively at each step the cardinality ofbody and head, and computing rule support and condence. At each step new rules are created by combining two rules found in the previous step. At rst, two rules with matching heads are combined: the new rule has the same head and the body is obtained by the union of the two bodies. In this way the cardinality of the rule body is increased. Afterwards, new rules are created by combining the heads, thus increasing head cardinality. Finally, the algorithm scans the CoreSource table to compute the support of bodies, in order to calculate the condence of extracted rules. 4 Experimental Results The experimental results in this section are aimed to the exploration of the trade-o between system performance and exible data analysis that we faced in the development of our prototype. For the experiments, we generated two synthetic databases, db1 and db2, using the publicly available datagen generator [4]. Database db1 contains groups and each grouphasanaverage of 5 tuples (i.e., each customer's purchase averages 5 products) this yields a total database size of tuples. The chosen data distribution yields rules with an average length of 2 elements. Database db2 has again groups, but the average number of tuples in each group is 10. Hence, the size of db2 is twice the size of db1. Furthermore, in this case, the selected data distribution yields an average rule length of 4 elements. The experiments have been performed on a (non dedicated) Digital Alpha Server with 256MB of RAM and the Oracle 8.0 ROLAP server. The rst experiment, whose result is presented in Figure 5(a), is devoted to the comparison of the performance of our architecture with respect to the traditional at le approach. We extracted simple association rules with minimum support 0.75% from both the data stored in binary les and the data stored in the ROLAP server. The at le approach, which is specically designed for the discovery of association rules with a xed structure and with very simple and unchangeable extraction criteria, yields clearly better performance (roughly faster by a factor of 8 for both data distributions). This result is obtained without performing specic optimizations of the database I/O (e.g., data is read tuple by tuple and not using arrays), to showtheworst case dierence between the two approaches. Optimized database access operations would allow closer performance results. Unfortunately, the superior performance of the at le approach is obtained at the price of loosing completely the possibility ofexiblyvarying the rule extraction criteria. The second experiment shows the eect of adding a mining condition (namely, BODY.price < HEAD.price) to the previous extraction criteria. In this case, we compared both performance and selectivity oftheabove complex 4 Elementary rules represent large itemsets of cardinality two, i.e., the rules with the lowest cardinality for body and head. 11

12 rules Legenda: preprocessing step on DW algorithm step on DW flat file approach rules rules rules db file db file simple with mining condition db1 db2 db1 Experiment a) Experiment b) Figure 5: Experimental results: (a) Comparison of the at le with respect to the AMORE-DW prototype approach, (b) Comparison of the execution of a simple statement with respect to a complex statement in the AMORE-DW architecture. statement with respect to the simple statement considered in the previous experiment. For the experiment, whose result is presented in Figure 5(b), we considered database db1. The addition of a more selective extraction criterion (the mining condition reduces the allowed product pairs) signicantly reduces the number of obtained rules (from 54 for the simple statement, to 7 for the complex statement), hereby improving the signicance of the extracted information. Furthermore, the preprocessing step performed by therolap server allows a signicant reduction of the time spent by the algorithm in building the rules hence, the obtained performance improves signicantly (by a factor larger than 25). In particular, while in the former experiment the work performed by the preprocessor was very small and its eect on the overall performance was negligible, in this case most of the time is spent in the preprocessing step, which prepares the elementary large rules for the extraction algorithm. Hence, the preprocessing step, although taking a longer time than before, can contribute signicantly to the improvement of the overall performance of the architecture. Finally note that the overall processing time for the complex statement isalsolower (roughly by a factor of 3) than the time required by at le processing of the simple statement. 5 Conclusions In this paper we described the architecture of the AMORE-DW system. AMORE-DW provides an environment for the specication of mining requests by means of a powerful SQL-like language for expressing rule extraction criteria. The MINE RULE operator has already been used for the specication of several mining problems applied to heterogeneous domains, e.g., the analysis of telephone data. The architecture of the AMORE-DW environment is characterized by a tight coupling with arolap server, that manages the access to the warehouse data from which rules are extracted. The exibility gained by accessing data stored in relational format is counterbalanced by the reduced performance of the system with respect to the traditional at le approach. Nevertheless, when more selective extraction conditions are specied, a signicant improvement of the overall processing time is obtained, owing to the eective preprocessing of data performed by the ROLAP server. Finally, we observe that a tight coupling of mining and warehousing technology yields several opportunities that we are currently exploring: 12

13 Selected relevant association rule sets can be precomputed and stored in the warehouse to guarantee optimal response time. These precomputed rule sets can be incrementally updated upon periodical update of the warehouse data. Extraction statements can be progressively rened, e.g., by providing more restrictive selection conditions the system can store and exploit the result of previous processing steps to simplify the successive extraction of rules selected by the rened statements. A collection of dierent extraction algorithms (e.g., [4, 10]), each one appropriate for a dierent type of data distribution, can be easily incorporated in the system and used for dierent warehouse data distributions. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc.ACM SIGMOD Conference on Management of Data, pages 207{216, Washington, D.C., May British Columbia. [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Knowledge Discovery in Databases, volume 2. AAAI/MIT Press, Santiago, Chile, September [3] R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on a relational database system. KDD-96, pages 287{290, [4] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th VLDB Conference, Santiago, Chile, September [5] R. Agrawal and R. Srikant. Mining sequential patterns. In International Conference on Data Engineering, Taipei, Taiwan, March [6] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. Sigmod Record, 26(1):65{74, March [7] J. Han, S. H. Chee, and J. Y. Chiang. Issues for on-line analytical mining of data warehouses. In In Proceedings of SIGMOD-98 Workshop on Research Issues on Data Mining and knowledge Discovery, Seattle, Washington, USA., June [8] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A data mining query language for relational databases. In Proceedings of SIGMOD-96 Workshop on Research Issues on Data Mining and knowledge Discovery, [9] R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In Proceedings of the 22st VLDB Conference, Bombay, India, September [10] J. S. Park, M. Shen, and P. S. Yu. An eective hash based algorithm for mining association rules. In Proceedings of the ACM-SIGMOD International Conference on the Management of Data, San Jose, California, May [11] A. Savasere, E. Omiecinski, and S. Navathe. An ecient algorithm for mining association rules in large databases. In Proceedings of the 21st VLDB Conference, Zurich, Swizerland, [12] R. Srikant and R. Agrawal. Mining generalized association rules. In Proceedings of the 21st VLDB Conference, Zurich, Switzerland, September

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center

Mining Association Rules with Item Constraints Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120, U.S.A. fsrikant,qvu,ragrawalg@almaden.ibm.com