Data Mining for Numeric Data

Size: px

Start display at page:

Download "Data Mining for Numeric Data"

Felicity Logan
5 years ago
Views:

1 Data Mining for Numeric Data Takeshi Fukuda June 14, 2000

2 ii

3 Contents 1 Introduction Contributions Numeric Association Rules Parallel Processing of Aggregate Queries Decision Trees Experiments Thesis Outline One-Dimensional Rules Introduction Association Rules Optimized Association Rules Main Results Extensions of Optimized Association Rules Related Work Preliminaries Association Rules Optimized Association Rules Buckets Making Equi-depth Buckets Algorithm Sample Size Parallel Bucketing Number of Buckets Algorithms iii

4 iv CONTENTS Optimized Confidence Rules Optimized Support Rules Optimized Gain Rules Generalization of Optimized Rules Optimized Ranges for Average Operator Performance Results Making Buckets Finding Optimized Rules Conclusions Two-Dimensional Rules Introduction Two-Dimensional Association Rules Main Results Visualization Related Work Preliminaries Pixel Grid and Regions Two-Dimensional Association Rules Optimized Rectangular Regions X-monotone and Rectilinear Convex Regions Computing Optimized-Gain x-monotone Regions Computing Optimized-Gain Rectilinear Convex Regions Computing Optimized-Confidence and -Support Regions Exactly Approximating Optimized-Confidence and -Support Regions Visualization Experimental Results Performance Overfitting Conclusions

5 CONTENTS v 4 Parallel Aggregation Introduction Motivating Example Algorithms for Single Queries Two-Phase Algorithm (2P) Repartitioning Algorithm (Rep) Broadcasting Algorithm (BC) Algorithms for Multiple Queries Two-Phase Algorithm (m2p) Repartitioning Algorithm (mrep) Broadcasting Algorithm (mbc) Number of Scans Analytical Evaluation Grouping Selectivity Number of Queries Network Speed Speedup and Scaleup Switching Points Empirical Evaluation Grouping Selectivity Switching Points Speedup and Scaleup Conclusions Decision Tree Introduction Decision Trees Handling Numeric Attributes Main Results Entropy-Based Data Splitting Entropy of a Splitting Region Splitting Selecting Correlated Attributes Algorithms

6 vi CONTENTS Naive Hand-Probing Algorithm Guided Branch-and-Bound Search Performance Computing Optimal Regions Computing Decision Trees Application to Credit Risk Analysis Conclusions Summary and Future Work Thesis Summary One-Dimensional Rules Two-Dimensional Rules Parallel Aggregation Decision Trees Future Work Acknowledgements 137 Bibliography 138

7 List of Figures 1.1 KDD process Sample size and probability of the error being over 50% Approximation by buckets The inner tangent of Q m and U r(m) Upper hulls Computing upper hulls Leaving L untouched Clockwise search Counter-clockwise search Performance of bucketing algorithms Finding optimized confidence rules Finding optimized support rules Rectangular, x-monotone, and rectilinear convex regions Visualization system F (i, m + 1) and the associated region Partition of a rectilinear region into monotone parts Convexity assumption Optimized-support range and its approximation by focused ranges A bad example Finding an optimized-confidence rectangle Finding an optimized-support rectangle Number of focused x-monotone regions Stamp points of focused regions vii

8 viii LIST OF FIGURES 3.12 Finding an optimized-gain x-monotone region Finding a focused optimized-confidence x-monotone region Finding a focused optimized-support x-monotone region Finding a focused optimized-support rectilinear convex region Finding a focused optimized-support rectilinear convex region Results for D linear Results for D circular Example of a data-cube lattice Example rule Selectivity vs. performance (Q=100) Selectivity vs. performance (Q=1) Proposed vs. conventional No. of queries vs. performance Network speed vs. performance Speedup and scaleup Switching points Selectivity vs. performance (Q=100) Selectivity vs. performance (Q=1) Proposed vs. conventional Switching points Speedup and scaleup Decision tree Healthy region and guillotine-cut subdivision x-monotone region splitting Hand-probe CPU time for computing the optimized region Rules for analyzing credit risk Decision tree for analyzing credit risk

9 List of Tables 1.1 Complexities for computing optimized regions Error range of approximation depending on the number of buckets Effective indices Top indices Parameters for the cost model Health check records Performance in computing the optimal region Tree construction time (1) Tree construction time (2) Financial statements of companies Splitting by region ix

10 x LIST OF TABLES

11 Chapter 1 Introduction In the activities of modern business organizations, massive amounts of data are input into computer systems and processed. For example, banks conduct and keep logs on most of their transactions by means of computers. At a cash register in a supermarket, labels on all the items bought by customers are read by a bar-code reader, and the data are processed by a computer to calculate the total amount of sales and to control inventory and stocks. If you buy something with a credit card, a computer processes and record the transaction. In such ways, huge amounts of data are generated and gathered everyday. This trend has been accelerated by the recent spread of data input devices such as bar-code readers, credit cards, OCRs, and ATM terminals. Most such data are discarded after being processed. However, since we can consider that a collection of such data captures information about the characteristics of products, customers, and business competitors, it is worth keeping the data stored in a database if it is capable of yielding higherlevel information (knowledge) useful for decision support, exploration, and better understanding of the phenomena generating the data. In fact, business organizations in real life have huge amounts of such data and are interested in extracting from them unknown information that inspires new marketing strategies. Traditionally, the task of data analysis have been performed by analysts, who familiarize themselves with the data, create summaries, and generate reports by using statistical techniques. To reduce the burden of this man- 1

12 2 CHAPTER 1. INTRODUCTION ual data analysis, a variety of research has been done on online analytical processing (OLAP) and data cubes [CCS93, GBLP95, HRU96, JS96, GHRU96, GHQ95, SAG96, AAD + 96, AGS96, SR95], and a number of statistical databases and multi-dimensional databases have been put on the market. However, such a verification-driven approach is only feasible when the volume and dimensionality of the data are quite low. Real-life data-sets may contain millions of records each with hundreds of attributes, and no one can begin to understand such vast quantities of high-dimensional data. Hence there is a growing need for a discovery-driven approach to aid in (at least partial) automation of analysis tasks. Database systems should be the primary tools for this task, but unfortunately today s database systems offer little functionality to support such knowledge discovery applications. At the same time, machine learning techniques usually perform poorly when applied to very large data sets. Thus, data mining (also known as knowledge discovery in databases, or KDD), whose aim is to discover useful knowledge in very large databases has been attracted considerable research interest [PS91, MAR96, AGI + 92, AIS93a, AIS93b, AS94, SA95, BFOS84, HCC92, NH94, PCY95, PS91, PSF91, Qui93, SAD + 93, AHPR95, ZRL96]. KDD is a new interdisciplinary field that merges ideas from databases, statistics, pattern recognition, neural networks, machine learning, and parallel computing. The name refers to the overall process of discovering new patterns for building models in a given dataset. There are many steps involved in the KDD process, of which the principal ones are: (1) data selection, (2) data cleaning and preprocessing, (3) data transformation and reduction, (4) data mining task and algorithm selection, and finally (5) post-processing and interpretation of discovered knowledge [FPSS96]. Figure 1.1 illustrates these KDD steps. Typical discovery-driven tasks include: Association Rules: Given a database of transactions, where each transaction consists of a set of items, association discovery finds all the item sets that frequently occur together, and also the rules governing their relations [AIS93b]. Efficient algorithms are reported in [AS94, PCY95, Bay98].

13 3 -PQYNGFIG 6CTIGV &CVC 2TGRTQEGUUGF &CVC 6TCPUHQTOGF &CVC 4WNGU 2CVVGTPU Ë ÀÅÀÅ¾Å¾ ÅË¼ÉÇÉ¼Ë ËÀÆÅ œí ÃÌ ËÀÆÅ 4CY &CVC «É ÅÊ½ÆÉÄ ËÀÆÅ É¼ÇÉÆº¼ÊÊÀÅ¾ ª¼Ã¼ºËÀÆÅ Figure 1.1: KDD process Sequential Patterns: Sequential pattern discovery is a variant of association rule discovery whose aim is to extract sequences of events that commonly take place over a period of time [AS95]. Classification and Regression: Classification aims to assign a new data item to one of several pre-defined categorical classes [WK91]. While classification predicts the value of a categorical attribute, regression is applied to cases in which the attribute being predicted has a real-valued domain. This task is also referred to as supervised learning, since it uses a training sample of previously solved cases in which the values of both the inputs and the corresponding outputs have been recorded. Clustering: Clustering aims to partition a database into subsets or clusters such that elements in a cluster look alike and elements in different clusters do not. This task is also known as unsupervised learning. Similarity Search: Similarity search aims to find objects that are within a user-defined distance from a queried object, or to find all the pairs within some distance of each other. This kind of search is

14 4 CHAPTER 1. INTRODUCTION especially applicable to temporal and spatial databases. 1.1 Contributions Numeric Association Rules Conventional association rule discovery algorithms consider only the interrelation of Boolean attributes. In addition to Boolean attributes, however, databases in the real world usually have numeric attributes. Thus, it is also an important issue to find association rules for numeric attributes. In this thesis, we consider the problem of finding association rules for numeric data. Since the domains of numeric attributes are often very large, there are potentially many conditions on them. An association rule with numeric attributes is interesting only if the rule has some special feature. We introduce the notion of optimized association rules with respect to the following three criteria: 1. confidence, 2. support, and 3. gain (a linear combination of support and hit). For one-dimensional association rules of the form (A I) = C, where A is a numeric attribute, I is a range of A, and C is the target condition of interest, we present sophisticated algorithms that compute an optimal range with respect to each of those criteria in linear time. We extend the optimized numeric association rules to two-dimensional ones of the form ((A, B) R) = C, where A and B are numeric attributes, R is a pixel region, and C is the target condition. We show that the problem of finding the optimal region is NP-hard if we consider arbitrary connected regions. We introduce the following three basic and handy classes of regions:

15 1.1. CONTRIBUTIONS 5 Region Optimization Criteria Class Gain Confidence Support Rectangular O(n 1.5 ) x-monotone O(n) O(n log M) (approximation) Rectilinear Convex O(n 1.5 ) O(n 1.5 log M) (approximation) Table 1.1: Complexities for computing optimized regions 1. rectangular regions, 2. x-monotone regions, and 3. rectilinear convex regions. We present algorithms for computing an optimized-gain, -confidence, or - support rectangular region in O(n 1.5 ) time, where n is the number of pixels, by transforming the problem into the computation of optimized ranges. We also present an asymptotically optimal (i.e. O(n)-time) algorithm for computing an optimized-gain x-monotone region, which uses sophisticated computational geometry techniques and dynamic programming. In addition, we present an efficient algorithm for computing an optimized-gain rectilinear convex region in O(n 1.5 ) time, and we prove that no algorithm can compute an optimized { -confidence, -support }{ x-monotone, rectilinear convex } region in a time that is polynomial with respect to n and log M unless P = NP, where M is the total number of tuples. To overcome this difficulty, we present an efficient algorithm for approximating the optimized-confidence and -support regions through the use of optimized-gain regions. Table 1.1 summarizes the complexities of our algorithms. We also examine the characteristics of the three classes of regions with regard to the prediction of unseen data. We show that the prediction accuracy of rectilinear convex regions is better and more stable than that of x-monotone regions. On the other hand, it is more expensive to compute optimized rectilinear convex regions than x-monotone regions; therefore both functions should be available so that the user can flexibly choose whichever is better suited to the application and data.

16 6 CHAPTER 1. INTRODUCTION Parallel Processing of Aggregate Queries In order to efficiently discover optimized numeric association rules, we need to discretize the domain of numeric attributes. This process can be a bottleneck in the overall mining process, and therefore we present a randomized algorithm that computes the discretization of a numeric attribute in O(N log M) time, where N is the number of tuples and M is the resolution of the discretization. We generalize this problem to the computation of multiple aggregate queries that often appear in OLAP systems, and present parallel algorithms that can be scaled to the volume of input and the number of processors on shared-nothing multiprocessors Decision Trees Although we have focused on optimized rules with one or two dimensions (i.e., conditional attributes) and presented efficient solutions, a database usually has many of attributes. Indeed, we need an automatic decision system to provide users with decision-making advice for analyzing a large-scale database in which many attributes have a combined effect on the target property. Although many solutions have been proposed, conventional methods are inefficient if any numeric attributes are strongly correlated, when it comes to handling numeric attributes. To solve this problem, we present an entropybased greedy method of constructing decision trees by using region splittings of the optimized numeric association rules. Our algorithm can compute an optimized-entropy x-monotone (resp. rectilinear convex) region in O(nM) (O(n 1.5 M)) time for the worst case, and O(n log M) (O(n 1.5 log M)) time for practical cases, where n is the number of pixels and M is the number of tuples. Although, in order to use the region splitting, we require an additional computation time proportional to the number of pairs of numeric conditional attributes, the improvements will be worth the computational cost if there are not too many numeric attributes in many applications. We apply the proposed method to credit risk analysis, to confirm that our method is useful in practice.

17 1.2. THESIS OUTLINE Experiments We show through experiments that our algorithms are fast and effective not only in theory but also in practice. The efficiency of our algorithms enables us to compute optimized rules for all combinations of hundreds of numeric and Boolean attributes in a reasonable time. We also confirm that the visualization of discovered rules is quite a useful and important way for users to gain insight from a large volume of numeric data. 1.2 Thesis Outline We begin by introducing the concept of optimized numeric association rules and presenting sophisticated algorithms for the one-dimensional case in Chapter 2. In Chapter 3, we extend the concept to the two-dimensional case, discuss the hardness of the problem, and give efficient solutions. In both chapters, we also give the results of experimental studies, which show the practical effectiveness of the algorithms. To efficiently discover numeric association rules, we discretize the numeric attributes. Chapter 4 generalizes this step to the processing of multiple aggregate queries and gives scalable solutions on shared-nothing multiprocessors. In Chapter 5, we present an efficient way of constructing decision trees by using one-dimensional and two-dimensional optimized numeric association rules. Finally, Chapter 6 summarizes the main contributions of this thesis, and suggests directions for future work.

18 8 CHAPTER 1. INTRODUCTION

19 Chapter 2 One-Dimensional Rules 2.1 Introduction Association Rules Given a database universal relation, we consider the association rule that if a tuple meets a condition C 1, then it also satisfies another condition C 2 with a certain probability (called a confidence). We will denote such an association rule (or rule, for short) between the presumptive condition C 1 and the objective condition C 2 by C 1 = C 2. 1 We call a rule exact if its confidence is 100%. For the purpose of discovering exact or almost exact rules, Piatetsky-Shapiro [PS91] presents the KID3 algorithm. Important rules in scientific databases are likely to be exact. On the other hand, business databases, such as customer databases and transaction databases, tend to reflect the uncontrolled real world, and the confidence of an interesting rule is usually much less than 100%. Thus, for commercial databases, we should consider a broader class of rules whose confidences are greater than a specified minimum threshold, such as 30%. We call such rules confident. Agrawal, Imielinski, and Swami [AIS93b] studied ways of discovering all confident rules. They focused on rules with conditions that are conjunctions of (A = yes), where A is a 1 We use the symbol = in order to distinguish the relationship from logical implication, which is usually denoted by. 9

20 10 CHAPTER 2. ONE-DIMENSIONAL RULES Boolean attribute, and present an efficient algorithm. They applied the algorithm to basket-data-type retail transactions to derive interesting associations between items, such as (Pizza = yes) (Coke = yes) = (P otato = yes). Improved versions of the algorithm have also been reported [AS94, PCY95] Optimized Association Rules Conventional association rule discovery algorithms consider only interrelation of Boolean attributes. In addition to Boolean attributes, however, databases in the real world usually have numeric attributes, such as age and balance of account in databases of bank customers. Thus, finding association rules for numeric attributes is also an important issue. In this chapter, we focus on finding a simple rule of the form (Balance [v 1,v 2 ]) = (CardLoan = yes), which states that customers whose balances fall in the range between v 1 and v 2 are likely to take out credit card loans. If an instance of the range is given, the confidence of this rule can be computed with ease. In practice, however, we want to find a range that yields a confident rule. Such a range is called a confident range. Unfortunately, a confident range is not always unique and, for instance, we may find a confident range that contains only a very small number of customers. Let the support of a range be the ratio of the number of tuples in the range to the number of all tuples. A range is called ample if its support is no less than a given fixed threshold. We want to find a rule associated with a range that is both ample and confident. In particular, we would like to find the confident range with maximum support, and we call the associated rule an optimized support rule. This range captures the largest cluster of customers that are likely to take out credit card loans with a probability no less than the given minimum confidence threshold. Here, we refer to a data set associated with a range of the numeric attribute value as a cluster, for short.

21 2.1. INTRODUCTION 11 Instead of the optimized support rule, it is also interesting to find the ample range that maximizes the confidence. We call the associated rule an optimized confidence rule. This range gives us a cluster of more than, for instance, 10% of customers that tend to use card loan with the highest confidence factor. If we want to promote credit card loans by sending direct mail to a fixed number of new customers within a limited budget, this rule gives us useful information on the target customers Main Results There are trivial ways of computing optimized support rules and optimized confidence rules in O(N 2 ) time, where N is the number of all tuples. In this chapter, we give a non-trivial linear time algorithm for each optimized rule, on the assumption that the data are sorted with respect to the numeric attribute. Each algorithm uses some computational geometry techniques to achieve linear time complexity. Given the sorted data, our algorithms are asymptotically optimal. Sorting the database, however, could create a serious problem if the database is much larger than the main memory, because sorting data for each numeric attribute would take an enormous amount of time. To handle giant databases that cannot fit in the main memory, we need to find another way of computing optimized rules. For this purpose, we present algorithms for approximating optimized rules, using randomized algorithms [MR95]. The essence of these algorithms is that we generate thousands of almost equi-depth buckets 2, and then combine some of those buckets to create approximately optimized ranges. In order to obtain such almost equi-depth buckets, we first create a sample of data that fits into the main memory, thus ensuring the efficiency with which the sample is sorted. Then, we sort the sample and divide it into equi-depth buckets. We show that the buckets in the sample also give almost equi-depth buckets in the original data with high probability. Tests show that our implementation is fast not only in theory but also 2 Buckets are called (almost) equi-depth if tuples are (almost) uniformly distributed into buckets.

22 12 CHAPTER 2. ONE-DIMENSIONAL RULES in practice. Even for a small case, our algorithm is faster than a naive quadratic-time algorithm by an order of magnitude. The efficiency of our algorithm makes it possible to compute a complete set of optimized rules for all combinations of hundreds of numeric and Boolean attributes in a reasonable time. We present some performance results Extensions of Optimized Association Rules We have been focusing on rather simple rules of the form (A [v 1,v 2 ]) = C, but our algorithms can be straightforwardly extended to generate rules of the form (A [v 1,v 2 ]) C 1 = C 2, where C 1 and C 2 are Boolean statements that do not contain any uninstantiated ranges on numeric attributes. Another interesting application of the algorithms for computing optimized association rules is to generate efficiently a range in a numeric attribute that maximizes the average of values in another attribute. For example, bankers are interested in customers whose saving account balances are very high. They therefore would like to know the range I of ages that maximizes the average saving account balance of customers in I under the condition that I contains an ample number (no less than a given threshold) of customers. We will discuss this problem in Section 2.5. It would also be valuable to extend our framework to rules with two numeric attributes in the presumptive condition, and to find the region in the two-dimensional space of these attributes that represents a nice association rule between these two numeric attributes and the conclusion. For instance, we would like to find a rule such as (Age, Balance) X = (CardLoan = yes), where X is a rectangle or a connected region in two-dimensional space of Age and Balance. Optimized rules can also be naturally defined in this extension. While the problem of finding the optimal arbitrary connected region is NPhard, we present practical solutions in Chapter 3 for the cases where the regions are rectangular, x-monotone, and rectilinear-convex.

23 2.2. PRELIMINARIES Related Work Some other work has been done on handling numeric attributes. Piatetsky- Shapiro [PS91] studied how to sort the values of a numeric attribute, divide the sorted values into approximately equi-depth ranges, and use only those fixed ranges to derive exact rules. Other ranges except for the fixed ones are not considered in his framework. Our method is not only capable of outputting optimized ranges, but is also more convenient than Piatetsky- Shapiro s method, since we need not make candidate ranges beforehand. Recently Srikant and Agrawal [SA96] have improved Piatetsky-Shapiro s method by considering combinations of some consecutive ranges. A combined range could be the whole range of the numeric attribute, which produces a trivial rule. To avoid this, Srikant and Agrawal present an efficient way of computing a combined range whose size is at most a threshold given by the user. Although Srikant and Agrawal s approach does not output optimized ranges, it can generate not only ranges but also interesting rectangular regions and hypercubes. The association rule is a fundamental tool for constructing efficient decision trees. If we consider a decision tree on a database containing numeric attributes, we need a method for computing good association rules. ID3 [Qui93], CART [BFOS84], CDP [AIS93b], and SLIQ [MAR96] perform binary partitioning of numeric attributes repeatedly until each range contains data of one specific group (or several groups, in some cases) with high probability, while IC [AGI + 92] uses k decomposition. Our optimized association rule is a powerful substitute for those known methods, and Chapter 5 gives a method of constructing efficient decision trees. 2.2 Preliminaries Association Rules Definition 2.1 Let R be a relation. To describe conditions on tuples in R, we use primitive conditions. For a Boolean attribute A, A = yes and A = no are primitive conditions. For a numeric attribute A, A = v and A [v 1,v 2 ] are primitive conditions. Let t be a tuple in R, and let t[a] denote the t s

24 14 CHAPTER 2. ONE-DIMENSIONAL RULES value for the attribute A. t meets A = v if t[a] is equal to v. t meets A [v 1,v 2 ]ift[a] belongs to the interval [v 1,v 2 ]. In order to describe more complicated conditions, we use also conjunctions of primitive conditions. Example 2.1 Consider a relation for retail transactions. Each attribute of the relation is a Boolean one whose domain is {yes, no}, and represents an item, such as Coke or Pizza. (Coke = yes) (Pizza = yes) is a condition, and a tuple that meets the condition represents a customer who purchased a coke and a pizza. Example 2.2 Consider a relation for data on a bank s customers. Suppose that each tuple contains the balance of account and services (card loan or automatic withdrawal, say) for one customer. (Balance [15821, 26264]) (CardLoan = yes) is an example of a condition. Definition 2.2 The support of condition C is defined as the percentage of tuples that meet condition C, and is denoted by support(c). Example 2.3 For instance, in the retail relation given in Example 2.1, if support(coke = yes) = 10%, then 10% of customers purchase a coke. Definition 2.3 Let C 1 and C 2 be conditions on tuples. An association rule (or rule for short) has the form C 1 = C 2. The confidence of rule C 1 = C 2 is defined as support(c 1 C 2 )/ support(c 1 ), which we will be denoted by conf(c 1 = C 2 ). Example 2.4 For instance, in the bank relation in Example 2.2, suppose that the confidence of the rule (Balance [15821, 26264]) = (CardLoan = yes) is 50%; then 50% of customers whose balances fall in the range use credit card loans.

25 2.2. PRELIMINARIES Optimized Association Rules Throughout this chapter, we focus on mining association rules of the form (A [v 1,v 2 ]) = C. Suppose that A and C are fixed. Definition 2.4 A rule is confident if its confidence is not less than the given minimum confidence threshold. Among confident rules, an optimized support rule maximizes support(a [v 1,v 2 ]). Definition 2.5 A rule is ample if support(a [v 1,v 2 ]) is not less than the given minimum support threshold. Among ample rules an optimized confidence rule maximizes the confidence. Example 2.5 Consider rules of the form: (Balance [v 1,v 2 ]) = (CardLoan = yes). Suppose that 50% is given as the minimum confidence threshold. We may have many instances of ranges that yield confident rules, such as: Range [1000, 10000] [5000, 5500] [500, 7000] Support 20% 2% 15% Confidence 50% 55% 52% Among those ranges, [1000, 10000] is a candidate range for an optimized support rule. Next, given 10% as the minimum support threshold, we may also have many ample rules, such as: Range [1000, 5000] [2000, 4000] [3000, 8000] Support 13% 10% 11% Confidence 65% 50% 52% The reader might feel it strange that although [1000, 5000] is a superset of [2000, 4000], the confidence of the rule of the former range is greater than that of the latter range, but observation will confirm that corresponding situations could really occur.

26 16 CHAPTER 2. ONE-DIMENSIONAL RULES Buckets Definition 2.6 Let t be a tuple of the given relation R, and let t[a] denote the value of the attribute A of t. Buckets of the domain of A are a sequence of disjoint ranges B 1,B 2,...,B M (B i =[x i,y i ] and x i y i <x i+1 ) such that the value of A for all tuples are covered by the buckets; namely, for an arbitrary tuple t R, there exists a bucket B j that contains t[a]. We say that a bucket B i is finest if B i =[x, x] for a value x. Example 2.6 If A represents age and the domain of A is non-negative integers bounded by 120, we can make 121 finest buckets [i, i] for each i =0, 1,...,120. When A shows the balances of millions of customers in a bank, the domain of A may range from $0 to $ In this case the number of finest buckets may amounts to millions. Linking consecutive buckets B s,b s+1,...,b t creates a range [x s,y t ]. Observe that if all buckets are finest, the combination of consecutive finest buckets gives the range of an optimized association rule. Given a large number (thousands, say) of buckets that may not be finest, an approximation of the range of an optimized rule can be obtained by joining consecutive buckets. Thus as ranges of rules, we only use those that consist of consecutive buckets. Definition 2.7 We call the number of tuples in {t R t[a] B i } the support of B i, and denote it by u i. We assume that each bucket B i contains at least one tuple; that is, u i 1. B i s are called equi-depth if the size of any B i is the same. Let v i denote the number of tuples in {t R t[a] B i,tmeets C}, and let N be the number of all tuples. ( t i=s v i )/( t i=s u i ) gives the confidence of rule (A [x s,y t ]) = C, and the support of A [x s,y t ]is( t i=s u i )/N. To compute u i and v i, for each tuple t, we need to determine the bucket that t[a] belongs to. One natural way of doing this is to scan each tuple once and locate the bucket to which the tuple belongs by using a hash function or

27 2.3. MAKING EQUI-DEPTH BUCKETS 17 by building an ordered binary tree of finest buckets. This technique works fast for a huge database if the number of finest buckets is small (recall the case of age in Example 2.6), but it may run very slowly if it has to handle millions of finest buckets, owing to the limited size of the main memory. Another natural method is to sort the given relation over A and divide the sorted data into finest buckets, but it takes an enormous amount of time to sort a giant database that is much larger than the main memory. The above discussion shows that the most difficult case is that in which the number of finest buckets is large and the size of the given database is huge. One such example may be the balances of millions of customers in a bank. In this case, for the sake of efficiency, we should avoid sorting a huge database and reduce the number of buckets to be considered. Thus our approach is to generate a small number (say thousands) of buckets, which may not be finest, instead of making millions of finest buckets. We will make almost equi-depth buckets so that we can make good approximations of optimized rules. In the next section we present a way of making almost equi-depth buckets without sorting the data. 2.3 Making Equi-depth Buckets We present a way of dividing N data into M buckets almost evenly Algorithm Since we must avoid sorting data, as mentioned in the previous section, we use the following approximation algorithm: Algorithm Make an S-sized random sample from N data. 2. Sort the sample in O(S log S) time. 3. Scan the sorted sample and set the i(s/m)-th smallest sample to p i for each i =1,...,M 1. Let p 0 be and p M be +.

28 18 CHAPTER 2. ONE-DIMENSIONAL RULES 4. For each tuple x in the original N data, find i such that p i 1 <x p i and assign x to the i-th bucket. This check can be done in O(logM) time by using the binary search tree for the buckets. Thus, for all i, the size u i of B i can be computed in O(N log M) time. The complexity of this algorithm is O(max(S log S, N log M)). In practice, S N, and hence the complexity is O(N log M). We evaluated the performance of this algorithm with very large sets of data (containing up to ten million tuples) and found that the computation time grows almost linearly in proportion to the data size. See Subsection Sample Size How many samples are enough to generate almost equi-depth buckets? Let S be the sample size and I be an interval that contains N/M original data. Let X denote the number of sample points that belongs to I. Since we pick each sample point independently and uniformly at random with replacement from the original data, the probability that a sample point belongs to I is 1/M. Hence, X follows a binomial distribution, B(S, 1/M), and therefore, we can compute the following probability for δ and M by using the tail probability of the binomial distribution: p e = Pr( X S/M δs/m). Note that p e does not depend on the number of tuples, N. Figure 2.1 shows the relationship between p e and S/M for δ =0.5 and M = {5, 10, 10000}. For every value of M, p e goes down sharply when S/M < 40. It becomes smaller than 0.3% when S/M = 40, and it does not decrease much when S/M > 40. Thus, in our implementation we use 40 M as S. In practice, a value of at most M =10 4 is precise enough to allow us to derive approximate rules, and a ( )-sized sample fits into the main memory. Subsection presents performance results for Algorithm 2.1 and a naive method using Quick Sort.

29 2.3. MAKING EQUI-DEPTH BUCKETS 19 Probability of Error over 50% (Pe) M = 5 M = 10 M = 10, # Sample Points / # Buckets (S / M) Figure 2.1: Sample size and probability of the error being over 50% Parallel Bucketing The most time-consuming part of Algorithm 2.1 is Step 4, which scans the entire database to find buckets for all tuples. Because we want to know only the size of each bucket, we can easily perform Step 4 in parallel: Algorithm Randomly distribute the tuples in the database to processor elements (PEs) almost evenly. 2. At a coordinating PE, execute Steps 1, 2, and 3 of Algorithm At each PE, perform Step 4 of Algorithm 2.1; namely, scan the divided data and count the number of data in each bucket. 4. Gather the results from all PEs to the coordinating PE and compute their sum. No communication is necessary during the counting process, and hence we expect that this algorithm will be scalable to the number of PEs. Chapter 4 shows more detailed study on parallel bucketing algorithms.

30 20 CHAPTER 2. ONE-DIMENSIONAL RULES support opt optimal range possible approximations support app 1/M 1/M Figure 2.2: Approximation by buckets Number of Buckets If all buckets are finest, every possible range can be expressed by connecting some subsequence of the buckets. Otherwise, the optimal range that we want to compute may not be generated from the buckets. Let us consider the possible error range caused by the granularity of buckets. Assume that we have M equi-depth buckets 3. As shown in Figure 2.2, the optimal range will be replaced by one of four possible approximate ranges. Let support opt and conf opt be the support and confidence of the optimal range, and let support app and conf app be those of an approximate range. Since the support of each bucket is 1/M, we can bind the error of an approximation as follows: support app support opt support opt 2 M support opt, conf app conf opt conf opt 2 M support opt 2. For example, when the support of the optimal range is 30% and its confidence is 70%, Table 2.1 shows how the support and confidence of an approximate range depend on the number of buckets. Observe that the approximation may contain significant error when the number of buckets is small. To 3 Using equi-depth buckets minimizes the possible error of approximations for any fixed number of buckets, since other bucketing methods will produce a larger bucket than 1/M.

31 2.4. ALGORITHMS 21 No. of buckets support app conf app % % 42.0% % % % 59.2% % % % 65.6% % % % 69.1% % 1, % % 69.5% % Note: support opt = 30% and conf opt = 70%. Table 2.1: Error range of approximation depending on the number of buckets make the error negligible, the number of buckets should be much larger than 1/ support opt. 2.4 Algorithms As explained in Subsection and 2.3.4, if all buckets are finest or if plenty of buckets are given, we can focus on rules whose ranges are combinations of consecutive buckets, and therefore we give algorithms for computing optimized rules among such rules. Precisely, given a sequence of buckets B 1,B 2,...,B M, such that B i =[x i,y i ] and x i y i <x i+1, we focus on rules of the form (A [x s,y t ]) = C, where [x s,y t ] is a combination of consecutive buckets B s,b s+1,...,b t. Definition 2.8 Since any range [x s,y t ] is specified by a pair of indexes s t, for simplicity, we denote support(a [x s,y t ]) by support(s, t) and denote conf((a [x s,y t ]) = C) by conf(s, t) throughout this section. Then, among rules such that support(s, t) is not less than a given threshold, an optimized confidence rule maximizes conf(s, t). On the other hand, among rules such that conf(s, t) is not less than a given threshold, an optimized support rule maximizes support(s, t).

32 22 CHAPTER 2. ONE-DIMENSIONAL RULES Optimized Confidence Rules Definition 2.9 Let B 1,...,B M be buckets. Let N denote the number of all tuples. Let u i denote the number of tuples in {t R t[a] B i }. We assume that u i 1. Let v i denote a real number associated with B i. Consider the sequence of points Q k =( k i=1 u i, k i=1 v i ) for k =1,...,M, and let Q 0 be (0, 0). Let m and n be non-negative integers such that m<n. Observe that the x-coordinate of Q n minus the x-coordinate of Q m is equal to N support(m +1,n). We call s t an ample pair if support(s, t), which is t i=s u i /N,isno less than the given minimum support threshold. We call m and n an optimal slope pair, ifm + 1 and n are an ample pair that maximizes the slope of Q m Q n. If more than one pair has the same maximum slope, select a pair that maximizes support(m + 1,n). In the special case when v i is the number of tuples in {t R t[a] B i,t meets C}, the slope of the line Q m Q n gives conf(m +1,n). Thus, if m and n are an optimal slope pair, (A [x m+1,y n ]) = C is an optimized confidence rule. We will therefore present an algorithm for computing an optimal slope pair. To compute an optimal slope pair, we use a technique of handling convex hulls, for which we introduce some special terms. Definition 2.10 Let S be a set of distinct points. A convex polygon of S has the property that any line connecting any two points of S must itself lie entirely inside the polygon. The convex hull of S is the smallest convex polygon of S. Let Q min be the node in S with the minimum x-coordinate, and let Q max be the node in S with the maximum x-coordinate. Observe that Q min and Q max are on the convex hull of S. From Q min we can visit nodes on the convex hull of S in clockwise (counterclockwise) order until we hit Q max, and we call the set of nodes visited the upper (lower) hull of S. Let U m denote the upper hull of {Q m,...,q M }, and let r(m) be min{i m +1 i is an ample pair}. Now consider the tangent of Q m and U r(m), and suppose that the tangent touches U r(m) at Q t, as illustrated in Figure 2.3. Q t is called the terminating

33 2.4. ALGORITHMS 23 Q t U r(m) Q r(m) Q m Figure 2.3: The inner tangent of Q m and U r(m) point of the tangent (if the tangent touches more than one node of U r(m), select the node with the maximum x-coordinate as Q t ). It is easy to see that if m n is an optimal slope pair, Q n is the terminating point of the tangent of Q m and U r(m). Thus, we need to find the tangent of Q m and U r(m) with the maximum slope among all m. To this end, we will present an algorithm whose computational complexity is linear in the number of buckets. Online Maintenance of Convex Hulls We present Algorithm 2.3 that constructs a data structure that represents the convex hull tree of Q 0,...,Q M, such as illustrated in Figure 2.4. While various implementations are possible [PS85], we use stacks S and D i (i = 0,...,M) in the data structure. We use S to store the sequence of nodes of the convex hull that we are focusing on, and we use D i to store a branch of the convex hull tree; namely, the nodes that belong to U i+1, but do not belong to U i. Algorithm 2.3 consists of a preparatory phase and a restoration phase. Given a sequence of nodes Q 0,...,Q M that is the sorted list with respect to the x-coordinate value, the preparatory phase sets each branch of the convex hull tree to D i, for i = M 1,...,0. the restoration phase makes U r(m) on S, for each m =0,...,M 1, using D i s.

34 24 CHAPTER 2. ONE-DIMENSIONAL RULES Algorithm 2.3 Suppose that we are given a sequence of nodes Q 0,Q 1,...,Q M that is the sorted list with respect to the x-coordinate value. Let S and D i (i =1,...,M) be empty. Preparatory Phase: For each i = M,...,0, we perform the following step so that, after the execution of each step, the top-to-bottom order of nodes in S corresponds to the clockwise order of nodes on U i (the upper hull of {Q i,...,q M }), which enables us to access the neighbors of each node Q on U i by looking at the next node and the previous node of Q in S. Initially, when i = M, push Q M onto S, which trivially makes S store U M. Otherwise, visit each node Q on U i+1 from Q i+1 in clockwise order (visit each node in S in top-to-bottom order), and find the Q i Q with the maximum slope. This search is done by performing the following procedure: Clockwise Search: If the slope of Q i and the top node of S is less than or equal to the slope of Q i and the node that is second from the top of S, the top node is no longer a node on U i ; in this case, pop the top node from S, push it onto D i, and repeat the check. Otherwise, the slope of Q i and the top node is maximum; in this case, push Q i onto S. D i s are used for recording the nodes deleted at each step. Since at most M 1 nodes are popped from S in the above check, the check is executed at most 2(M 1) times, and hence the time and space complexities of clockwise search are O(M). After the termination, S stores U 0. Restoration Phase: Assume that S stores U 0. Make S contain U 1 by popping the top node Q 0 from S and pushing back all nodes of D 0 onto S. Set 1 to i. We use i to search for r(m) for each m =0,...,M 1. Now for each m =0,...,M 1, perform the following procedure so that S store U r(m) : While (m+1,i) is not an ample pair, pop the top node Q i from S, push back all nodes of D i in top-to-bottom order onto S, which

35 2.4. ALGORITHMS 25 Q 8 Q 9 Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 0 Figure 2.4: Upper hulls makes S store U i+1, and increment i. restoration phase. If i > M, we stop the Observe that, after the execution of the above while statement, i contains r(m), and hence S stores U r(m). Since throughout the execution, at most M nodes are popped from S, and at most M nodes are pushed back from D i stos, the overall computation time is O(M). Example 2.7 Consider nodes Q 0,Q 1,...,Q 9 in Figure 2.4. The dotted line from Q i to Q 9 shows the upper hull of {Q i,...,q 9 }. Let us apply the preparatory phase of Algorithm 2.3 to {Q 0,...,Q 9 }. Each column of the upper table in Figure 2.5 illustrates the content of S for each i =9,...,0. Observe that each column contains the upper convex hull of {Q i,...,q 9 }. Each column of the lower table in Figure 2.5 shows D i for i =9,...,0. We can also see how the restoration phase works by observing the columns from i = 0 to 9. Computing Tangents Now we are in a position to present the algorithm for computing the tangent of Q m and U r(m) with the maximum slope among all m.

36 26 CHAPTER 2. ONE-DIMENSIONAL RULES Q 3 Q 4 Q 4 Q 2 Q 0 Q 7 Q 6 Q 5 Q 5 Q 5 Q 5 Q 1 Q 1 Q 8 Q 8 Q 8 Q 8 Q 8 Q 8 Q 8 Q 8 Q 8 Q 9 Q 9 Q 9 Q 9 Q 9 Q 9 Q 9 Q 9 Q 9 Q 9 i = Q 4 Q 5 Q 7 Q 6 Q 3 Q 2 D 9 D 8 D 7 D 6 D 5 D 4 D 3 D 2 D 1 D 0 Figure 2.5: Computing upper hulls Algorithm 2.4 Given Q 0,Q 1,...,Q M, perform the preparatory phase of Algorithm 2.3, so that in what follows, we can obtain U r(m) for each m = 0,...,M by executing the restoration phase of Algorithm 2.3. We use variable L to store the tangent of Q m and U r(m) for some m. Base Step: Compute U r(0) by using the restoration phase of Algorithm 2.3. Find the terminating point of the tangent of Q 0 and U r(0) by clockwise search; that is, visit each node Q on U r(0) from Q r(0) in clockwise order, find the Q 0 Q with the maximum slope, and set the tangent to L. Inductive Step: For each m =1,...,M, while U r(m) is not empty, compute U r(m) by using the restoration phase of Algorithm 2.3, assume that L stores the tangent of Q k and U r(k) for some k<m, and perform one of the following steps: If Q m is above or on L, the slope of the tangent of Q m and U r(m) is not greater than that of L. Figure 2.6 illustrates an example of this case. We do not compute the tangent of Q m and U r(m), and leave L untouched. Otherwise, let Q t be the terminating point of L, compute the tangent of Q m and U r(m) by one of the following steps, and set the tangent to L:

37 2.4. ALGORITHMS 27 U r(k) U r(m) Q m L Q k Figure 2.6: Leaving L untouched U r(k) X U r(m) Q t Q r(m) Clockwise Search L Q m Q k Figure 2.7: Clockwise search

38 28 CHAPTER 2. ONE-DIMENSIONAL RULES Q t U r(m) Counter-clockwise Search L Q r(m) Q m Q k Figure 2.8: Counter-clockwise search If L does not touch U r(m) (Q t is on the left-hand side of Q r(m) ), find the terminating point of Q m and U r(m) by clockwise search; that is, visit each node Q on U r(m) from Q r(m) in clockwise order, and find the Q m Q with the maximum slope. Figure 2.7 illustrates this case. Among all nodes on both U r(k) and U r(m), let X denote the node with the minimum x-coordinate. The above clockwise search only scans edges from Q r(m) to at most X; otherwise, U r(k) cannot be convex, since the terminating point is above Q m X, and at least one node on the left-hand side of X on U r(k) is also above or on Q m X. Since edges between Q r(m) and X are hidden inside U r(k), those edges have never been scanned in this algorithm. Otherwise, L touches U r(m) at Q t. Find the terminating point of Q m and U r(m) by counter-clockwise search; that is, visit each node Q on U r(m) from Q t in counter-clockwise order, and find the Q m Q with the maximum slope. Figure 2.8 illustrates this case. Note that edges between Q r(m) and Q t have never been scanned before in this algorithm.

39 2.4. ALGORITHMS 29 Final Step: Among all tangents that have been set to L, select the ones with the maximum slope. Theorem 2.1 Algorithm 2.4 computes all optimal slope pairs in O(M) time. Proof Let S denote the set of edges on U r(m) for all m. Both the clockwise search and the counter-clockwise search in the algorithm scan each edge in S at most once. Since the number of edges in S is at most M 1, the algorithm computes tangents with the maximum slope in O(M) time Optimized Support Rules Definition 2.11 Let B 1,...,B M be buckets such that B i =[x i,y i ] and x i y i <x i+1. Let N denote the number of all tuples. Let u i denote the number of tuples in {t R t[a] B i }. Let v i denote a real number associated with B i. Let s and t be non-negative integers such that s t. Let avg(s, t) denote ( t i=s v i )/( t i=s u i ). Let θ be a minimum threshold for avg(s, t). We call (s, t) an optimal support pair if avg(s, t) θ, and (s, t) maximizes support(s, t) (= ( t i=s u i )/N). In the special case when v i is the number of tuples in {t R t[a] B i,t meets C}, avg(s, t) is equal to conf(s, t). Suppose that θ is the minimum confidence threshold. If (s, t) is an optimal support pair, (A [x s,y t ]) = C is an optimized support rule. We will now present an algorithm for computing an optimal support pair. Definition 2.12 Let us call s effective if avg(j, s 1) <θfor every j<s. Lemma 2.1 If s t is an optimal support pair, s is effective. Proof Otherwise, there exists j such that avg(j, s 1) θ. Since s t is an optimal support pair, avg(s, t) θ, and hence avg(j, t) θ, which contradicts the optimality of s and t (see Table 2.2).

40 30 CHAPTER 2. ONE-DIMENSIONAL RULES B j... B s 1 B s... B t conf(j, s 1) θ conf(s, t) θ conf(j, t) θ Table 2.2: Effective indices From the above lemma, we will find all effective indices and choose an optimal support pair. Let w be max s 1 j<s i=j (v i θu i ) for each index s. Then, note that s is effective if and only if w<0. The following algorithm computes w for all indices in O(M) by scanning buckets forwards, and gives the set of all effective indices. Algorithm is effective; w := 0; for each s := 2 to M begin w := v s 1 θu s 1 + max{0,w}; if (w <0) then s is effective end. Definition 2.13 Let top(s) denote the largest index t, such that s t and avg(s, t) θ. The final step is to choose a value of s that maximizes top(s) i=s u i. Lemma 2.2 If s<s are effective, top(s) top(s ). Proof Since s is effective, avg(s, s 1) <θ. From the definition of top(s), avg(s, top(s)) θ. Then, it follows that avg(s, top(s)) θ, which implies that top(s) top(s ) (see Table 2.3). Thanks to this property, we only need to scan backwards through the list of effective indices (s(1),...,s(q)) and the list of all indices (1,...,M) alternately to find top(s(i)). We can do this by means of the following algorithm:

41 2.4. ALGORITHMS 31 B s... B s 1 B s... B top(s) conf(s, top(s)) θ conf(s, s 1) <θ conf(s, top(s)) θ Table 2.3: Top indices Algorithm 2.6 i := M; for each j from q to 1 begin while (avg(s(j),i) <θ) do begin i := i 1; end; top(s(j)) := i end. In the above algorithm, we pre-compute a cumulative table F (j) = j i=1 v i θ j i=1 u i. Since avg(s(j),i) <θ iff F (i) F (s(j) 1) < 0, we can check avg(s(j),i) <θin a constant time (F (0) is defined as 0). Thus both Algorithms 2.5 and 2.6 run in O(M) time, and we have: Theorem 2.2 All optimal support pairs can be computed in O(M) time Optimized Gain Rules In the algorithm literature, Bentley [Ben84] introduced a linear-time algorithm (Kadane s algorithm), which unfortunately does not work for finding the optimized confidence and support rules. Kadane s algorithm computes a range I =[s, j] that maximizes t i=s x i against an array of real numbers x i. If every x i is non-negative, trivially [1,M] gives the solution, so we assume that some elements are negative. Let a(j) denote max{ t i=s x i s t j};

42 32 CHAPTER 2. ONE-DIMENSIONAL RULES then the interval of a(m) is our answer. To compute a(j) we introduce another auxiliary data item b(j) that denotes max{ t i=s x i s j}. Then, the following relations hold: b(j + 1) = max{0,b(j)} + x j+1 a(j + 1) = max{b(j + 1),a(j)} Set 0 to b(0). Then, simple dynamic programming gives a linear-time solution. Definition 2.14 For a confidence threshold θ, call t i=s (v i θu i ) the gain of (s, t) and denote it by gain(s, t). We call s and t an optimal gain pair, if (s, t) maximizes gain(s, t). A rule that corresponds to an index pair (s, t) is an optimized gain rule, if(s, t) is an optimal gain pair. If v i θu i is set to x i, Kadane s algorithm computes the optimal gain pair. Therefore we have: Theorem 2.3 All optimal gain pairs can be computed in O(N) time. Note that the range I of the optimized gain rule is not equivalent to the range of the optimized support rule, since there may be a larger confident range I I Generalization of Optimized Rules Our algorithms can be straightforwardly extended to compute rules of the general form (A [v 1,v 2 ]) C 1 = C 2, where C 1 and C 2 are Boolean statements that do not contain any variables on numeric attributes. Let u i be the size of {t R t[a] B i,tmeets C 1 }, let v i be the size of {t R t[a] B i, t meets C 1 and C 2 }, and apply our algorithms to this case.

43 2.5. OPTIMIZED RANGES FOR AVERAGE OPERATOR Optimized Ranges for Average Operator In this section, we present an application of the algorithm for computing optimized slope pairs and the algorithm for finding optimized support pairs, which are given in Section 2.4. In decision-support-type query processing, range queries are often accompanied by aggregates. For instance, bankers want to characterize excellent customers whose saving account balances are relatively high. In order to characterize such customers, the database user may guess that the range from 1,000 to 3,000 is promising and issue the query: SELECT avg(savingaccount) FROM BankCustomers WHERE 1000 < CheckingAccount < To discover a satisfactory range with a high average, however, the user might have to guess and generate many queries for various ranges. Instead, we want to obtain the optimized range that maximizes the average of SavingAccount among all ranges in CheckingAccount that contain an ample percentage of customers, say no less than 10%. In what follows, we formalize this problem and present an efficient way of computing optimized ranges by using the algorithms for generating optimized association rules. Definition 2.15 Let R be a relation. Let A denote a numeric attributes of R. Suppose that [v 1,v 2 ] is a range in A. The support of range [v 1,v 2 ] denoted by support([v 1,v 2 ]) is defined as the percentage of tuples for which the values of A fall within the range. Let B be another numeric attribute to which the average operator is applied. We call B the target numeric attribute. The sum B of range [v 1,v 2 ] denoted by sum B ([v 1,v 2 ]) is defined as the summation of the values of B for all tuples in which the value of A falls within [v 1,v 2 ]. Let N denote the number of tuples in R. The avg B of range [v 1,v 2 ] denoted by avg B ([v 1,v 2 ]) is sum B ([v 1,v 2 ])/(N support([v 1,v 2 ])). Example 2.8 Consider a BankCustomers database with numeric attributes CheckingAccount and SavingAccount. Let [v 1,v 2 ] be a range in the domain of CheckingAccount.

44 34 CHAPTER 2. ONE-DIMENSIONAL RULES avg SavingAccount ([v 1,v 2 ]) is the average of saving account balances of tuples whose checking account balances are in [v 1,v 2 ]. There is a trade-off between maximizing support(i) and maximizing avg B (I) for a range I. We therefore give a minimum threshold for either of support(i) or avg B (I) and compute the range that maximizes the value of the other. Definition 2.16 Suppose that a minimum threshold is given for the support of a range. Among ranges that meet this constraint, the maximum average range I maximizes avg B (I). Example 2.9 Consider our running example. Suppose that 10% is given as the minimum threshold for the support of a range. The maximum average range I maximizes avg SavingAccount (I) among all ranges in CheckingAccount whose support is no less than 10%. Definition 2.17 Suppose that we are given a minimum average threshold for avg B (I) of a range I that is greater than the average of all data. Among all ranges that meet this constraint, the maximum support range I maximizes support(i). If the threshold is no greater than the average of all data, it is trivial that the longest range, namely the domain of B, presents the maximum support range. Example 2.10 In our running database BankCustomers, bankers may want to use 10, 000, which is greater than the average of all data, as the minimum threshold for the average of saving account balances, and they want to have the maximum support range in CheckingAccount that maximizes the number of customers whose checking account balances are in the range. As in the case of computing the optimized confidence/support rules, we divide the domain of B into buckets B 1,...,B M, and we only consider ranges that are combinations of consecutive buckets B s,...,b t. Let B i be [x i,y i ] such that x i y i <x i+1. Let u i be the number of tuples in {t R t[a] B i }. Let support(s, t) be t i=s u i. Let v i denote {t R t[a] B i } t[b]. Let avg(s, t) denote ( t i=s v i )/( t i=s u i ).

45 2.6. PERFORMANCE RESULTS 35 Given a minimum support threshold for support(s, t), Algorithm 2.4 computes an optimal slope pair s t in O(M) time, and therefore [x s,y t ] is the maximum average range, since we focus on ranges combined by consecutive buckets. Also, given a minimum average threshold for avg(s, t), Algorithm 2.6 generates an optimal support pair s t in O(M) time, and hence [x s,y t ]is the maximum support range. 2.6 Performance Results The proposed algorithms have been implemented in C++ as functions of our Database SONAR (System for Optimized Numeric Association Rules) prototype. We evaluated their performance on an IBM Power Series 850 with a 133-MHz PowerPC 604 and 96 MB of main memory, and running AIX Making Buckets We randomly generated test data with 8 numeric attributes and 8 Boolean attributes; that is, with 72 bytes per tuple. The test data resided in the AIX file system on a GB IDE drive. As a test case, we divided the test data into 1,000 buckets with respect to each numeric attribute and counted the number of tuples in every bucket for each Boolean attribute. We compared the performance of our bucketing algorithm, Algorithm 2.1, with those of two other methods. One of these methods, which we call Naive Sort, sorts data for each numeric attribute by using Quick Sort. The other one, which we call Vertical Split Sort, first splits data vertically to generate a smaller table with tuple identifier and each numeric attribute, and then sorts the temporary table. Figure 2.9 shows the execution times for numbers of tuples ranging from to For large data sets with more than one million tuples, Algorithm 2.1 outperforms the naive method by more than an order of magnitude, and it also beats Vertical Split Sort by a factor of 2 to 4. Furthermore, the execution time of Algorithm 2.1 grows almost linearly in proportion to the

46 36 CHAPTER 2. ONE-DIMENSIONAL RULES Algorithm 2.1 Vertical Split Sort Naive Sort 1200 Elapsed time [sec] e e+06 5e+06 # of data Figure 2.9: Performance of bucketing algorithms data size Finding Optimized Rules We evaluated the performance of the algorithms for finding optimized rules, comparing the results with those of a naive method that computes, in quadratic time, the confidence and support of all ranges in order to find an optimal one. Figures 2.10 and 2.11 respectively show the execution times for finding optimized confidence rules with a 5% support threshold and optimized support rules with a 50% confidence threshold for buckets whose sizes are from 100 to 1,000,000. Our algorithm for finding optimized confidence rules beats the naive method by more than an order of magnitude for more than 500 buckets. For finding optimized support rules, our algorithm is also faster than the naive method by more than an order of magnitude for data sets of more than 100 buckets. Even for small data sets, both of our algorithms outperform the naive ones. The execution time of both algorithms increases linearly in proportion to the number of buckets. In the case for finding op-

47 2.7. CONCLUSIONS 37 timized confidence rules with minimum support of 5%, the execution time turns to increase rapidly when the number of buckets is more than 800,000 (the second graph of Figure 2.10). This is because the algorithm uses more memory than the test machine has. We observed that the operating system reported frequent page faults, which were enforced by the extensive memory accesses for the small minimum support. 2.7 Conclusions In this chapter we have considered the problem of finding numeric association rules of the form: (A I) = C, where A is a numeric attribute, I is a range of A, and C is a target condition of interest. As there may be many confident and ample ranges, we introduced three criteria for the optimality of rules: (1) optimized confidence, (2) optimized support, and (3) optimized gain. We presented efficient algorithms for computing the range of optimized confidence and optimized support rules, and showed that the above three types of optimized numeric association rules can be computed in linear time, if the input data are sorted. In order to efficiently handle a large volume of data, we presented a randomized algorithm that generates almost equi-depth buckets of numeric data in O(N log M) time, where N is the number of tuples and M is the number of buckets. Experiments show that our implementation is fast not only in theory but also in practice. The efficiency of our algorithms enables us to compute optimized rules for all combinations of hundreds of numeric and Boolean attributes in a reasonable time.

48 38 CHAPTER 2. ONE-DIMENSIONAL RULES 0.07 Algorithm 2.3 (min sup = 5%) Naive Method (min sup = 5%) Elapsed time [sec] # of buckets 12 Algorithm 2.3 (min sup = 5%) Algorithm 2.3 (min sup = 50%) Algorithm 2.3 (min sup = 90%) 10 8 Elapsed time [sec] e+06 # of buckets Figure 2.10: Finding optimized confidence rules

49 2.7. CONCLUSIONS Algorithm (min conf = 50%) Naive Method (min conf = 50%) Elapsed time [sec] # of buckets 3 Algorithm (min conf = 10%) Algorithm (min conf = 50%) Algorithm (min conf = 90%) 2.5 Elapsed time [sec] e+06 # of buckets Figure 2.11: Finding optimized support rules

50 40 CHAPTER 2. ONE-DIMENSIONAL RULES

51 Chapter 3 Two-Dimensional Rules 3.1 Introduction Two-Dimensional Association Rules In Chapter 2, we have considered association rules for a numeric attribute and a Boolean attribute. In the real world, such binary associations between two attributes are usually not enough to describe the characteristics of a data set, and therefore we often want to find a rule for more than two attributes. The main aim of this chapter is to generate rules (which we call two-dimensional association rules) that represent the dependence on a pair of numeric attributes of the probability that an objective condition (corresponding to a Boolean attribute) will be met. For each tuple t, let t[a] and t[b] be its values for two numeric attributes A and B; for example, t[a] = Age of a customer t and t[b] = Balance of t. Then, t is mapped to a point (t[a],t[b]) in an Euclidean plane E 2.For a region R in E 2, we say a tuple t meets condition (Age, Balance) R if t is mapped to a point in R. We want to find a rule of the form ((A, B) R) = C, such as ((Age, Balance) R) = (CardLoan = yes). In practice, we consider a huge database containing millions of tuples, and hence we have to handle millions of points, which may occupy much more space than the main memory. To avoid dealing with such a large number 41

52 42 CHAPTER 3. TWO-DIMENSIONAL RULES of points, we discretize the problem; that is, we distribute the values for each numeric attribute into N equi-depth buckets. We divide the Euclidean plane into N N pixels (unit squares), and we map each tuple t to the pixel containing the point (t[a],t[b]). We use a union of pixels as the region of a two-dimensional association rule. The probability that the tuples in a pixel (respectively, a region) satisfy the objective condition C is called the confidence of the pixel (region). We would like to find a confident region whose confidence is above some threshold. Let M denote the total number of records in the given database. We expect that the average number of records mapped to one pixel will be neither too small nor too large. In practice we assume that 3 M N M, thereby ensuring that the average number of records mapped to one pixel ranges from 1to 3 M. We use a union of pixels as the region of a two-dimensional association rule. Confident regions, ample regions, and optimized-confidence and -support regions can be naturally defined as in the case of one-dimensional association rules. The shape of region R is important for obtaining a good association rule. For instance, if we gather all the pixels whose confidence is above some threshold, and define R to be the union of these pixels, then R is a confident region with (usually) a high support. A query system of this type is proposed by Keim et al. [KKS94]. However, such a region R may consist of many connected components, creating an association rule that is very difficult to characterize, and whose validity is hence hard to see. Therefore, in order to obtain a rule that can be stated briefly or characterized through visualization, it is required that R should belong to a class of regions that have nice geometric properties. We consider three classes of geometric regions: (axis-aligned) rectangular regions, x-monotone regions, and rectilinear convex regions. A rectangular region is useful, as it is very simple and intuitive. However, it is often too restrictive; for example, it is impossible for axis-aligned rectangles to capture data distributed along a diagonal line. On the other hand, if we consider arbitrary connected regions, the problem of finding the optimal region becomes intractable. Therefore we need a handy class of regions that is more general (flexible) than rectangular regions and more specific

53 3.1. INTRODUCTION 43 Rectangular x-monotone Rectilinear convex xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xx xxxxxx xxxxx xxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxx xxxxx xxxx xxxxx xxx xxx xxxx xxxxxx xxxxxx xxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxx xxxxxxxx xxxxx xxx Figure 3.1: Rectangular, x-monotone, and rectilinear convex regions (restrictive) than arbitrary connected regions. We consider two basic classes of grid regions, x-monotone regions and rectilinear convex regions, as instances of such classes of regions. A connected region is called x-monotone if its intersection with any vertical line is undivided. A connected region is called rectilinear convex if both its intersection with any vertical line and its intersection with any horizontal line are always undivided. In other words, a region is rectilinear convex if it is x-monotone and y-monotone. Figure 3.1 shows examples of rectangular, x-monotone, and rectilinear convex regions Main Results To efficiently compute x-monotone regions, we generalize an algorithm introduced by Asano et al. [ACKT96], which segments an object image from the background in a gray image, to our color-image case, and obtain a linear-time algorithm for computing the optimized-gain x-monotone region, which maximizes the gain (see Section 3.2 for a definition). Although this algorithm uses sophisticated dynamic programming with fast matrix-searching [AKM + 87], we have confirmed experimentally that our implementation is fast not only in theory but also in practice. If we consider rectangular regions instead of x-monotone regions, the computation time for the optimal gain rule increases to O(n 1.5 ), where n is the number of pixels in the grid. We present an O(n 1.5 )-time algorithm for computing optimized-gain rec-

54 44 CHAPTER 3. TWO-DIMENSIONAL RULES tilinear convex regions, which uses another type of dynamic programming. Next, we give efficient approximation algorithms for generating optimizedconfidence and optimized-support x-monotone and rectilinear convex regions rules (defined in Section 3.2), through the use of optimized-gain regions Visualization It is convenient to visualize the region of a two-dimensional association rule ((A, B) R) = C. Let us regard the region R as a color image on a grid G as follows: Let u i,j and v i,j denote the total number of tuples and the number of tuples satisfying the objective condition C in the (i, j)-th pixel G(i, j), respectively. The pixel G(i, j) has a color vector (v i,j,u i,j v i,j, 0) in RGB space. This means that its red level is v i,j, its green level is u i,j v i,j, and its blue level is 0. Hence, the brightness level is u i,j, which represents the number of tuples that fall within the pixel. The confidence of each pixel is represented by its color; thus, a redder pixel has a higher confidence. We construct a visualization system for our two-dimensional association rules by reforming the color image introduced above, transforming it to make the rules easier for humans to grasp visually. We have tested our system, together with the functions given in Chapter 2 and Chapter 5, on a test database, and discovered several simple new rules, some of which seem to be of potential value in strategy planning. The database has about 30 numeric attributes and about 100 Boolean attributes. Suppose that we select two numerical attributes, Age and Balance, and also a Boolean attribute, Card Loan Delay. These attributes represent the ages of customers, the balances of their accounts, and whether they have delayed payment of their credit card charges, respectively. In order to characterize unreliable customers who have delayed payment, we consider the rule ((Age, Balance) R) = (CardLoanDelay = yes), and find an optimized region of R. We divide data into sets of pixels. Our visualization system shows an optimized region in those pixels, as in Figure 3.2. The data has an average confidence of 3.51%, which means that

55 3.1. INTRODUCTION 45 Figure 3.2: Visualization system

56 46 CHAPTER 3. TWO-DIMENSIONAL RULES 3.51% of customers have delayed their credit card payments at one time or another. By pushing the Query button, we can see an optimized-gain region enclosed in thick lines. The confidence and the support (the percentage of tuples in the region) can be found at the bottom of the window. By moving the slider, we can control the trade-off between the support and the confidence: if we move the slider to the left, the support increases while the confidence decreases. In response to such a movement, the visualization system recomputes the region together with its support and confidence. Our system is so fast that one can continuously move the slider and see how the region changes, as if one were watching a motion picture. In Figure 3.2, we can find a region with support of 14.1% and confidence of 12.8%, which is much higher than the average confidence. The shape of the region tells us the characteristics of unreliable customers. As a result, we can say that Unreliable customers, who have delayed payment of their card charges, can be characterized as relatively young people whose balance is low Related Work Interrelation of paired numeric attributes constitutes a major research topic in statistics; for example, covariance and line estimators are well-known tools for representing interrelations. However, these tools only show interrelations in an entire data set, and thus cannot extract a subset of data in which a strong interrelation holds. In order to extract strong local interrelations, several heuristics using geometric clustering techniques have been introduced [NH94]. When we compute a two-dimensional association rule, we could regard the Boolean attribute in the objective condition as a special numeric attribute, and apply three-dimensional clustering methods to extract some interrelations. However, this approach has serious defects. The three attributes we have discussed do not have equivalent roles: the Boolean attribute corresponds to the objective condition, whereas the others give presumptive conditions. Therefore, a clustering method with respect to a three-dimensional

57 3.1. INTRODUCTION 47 proximity criterion does not find a good rule. Another defect is that clustering algorithms in three-dimensional space are often time-consuming. For example, if we want to compute the three-dimensional unit ball that contains the maximum number of points in a given set of n points, it will take O(n 3 ) time if we use a standard computational geometric algorithm. Some other studies have also handled numeric attributes and tried to derive rules. Piatetsky-Shapiro [PS91] investigated how to sort the values of a numeric attribute, divide the sorted values into approximately equal-sized ranges, and use only those fixed ranges to derive rules whose confidences are almost 100%. His framework does not take account of any ranges other than the fixed ones. Our method is not only capable of outputting optimized ranges, but is also more handy than Piatetsky-Shapiro s method, since we do not need to create candidate ranges beforehand. Recently, Srikant and Agrawal [SA96] improved Piatetsky-Shapiro s method by adding a way of combining consecutive ranges into a single range. The combined range could be the whole range of the numeric attributes, in which case the generated rule is trivial. To avoid this, Srikant and Agrawal presented an efficient way of computing a combined range whose size is at most a threshold given by the user. This method can discover association rules with more than two numeric attributes, while ours can handle at most two. On the other hand, their method uses only axis-aligned rectangles (hypercubes in highdimensional cases). Some techniques, related to but not directly applicable to finding optimized association rules, have been developed for handling numeric attributes in the context of deriving decision trees that are used for classifying data into distinct groups. ID3 [Qui86a, Qui93], CART [BFOS84], CDP [AIS93b], and SLIQ [MAR96] perform binary partitioning of numeric attributes repeatedly until each range contains data of one specific group with high probability, while IC [AGI + 92] uses k decomposition. Since both methods tend to yield large decision trees, methods such as pruning some branches and linking some ranges together have also been proposed, to reduce the size of the trees. The problem of extracting a optimal region resembles that of image segmentation in computer vision. Considerable work has been done on the segmentation of intensity images, and five different approaches have mainly

58 48 CHAPTER 3. TWO-DIMENSIONAL RULES been used: threshold techniques, edge-based methods, region-based methods, hybrid techniques, and connectivity-preserving relaxation methods (e.g., [Zuc76, HS85, BJ88, SSW88, KWT88]). Some of these methods, especially region-based methods and connectivity-preserving relaxation methods, can be used for finding two-dimensional association rules. One of the disadvantages of the above methods is that they heuristically search for regions that look good to humans, while our method can find the optimal regions with respect to natural objective criteria. 3.2 Preliminaries Throughout this chapter, we focus on mining association rules of the form ((A, B) R) = C. Suppose that A, B, and C are fixed Pixel Grid and Regions In practice, since we consider a huge database containing millions of tuples, we have to handle millions of points, which may occupy much more space than the main memory. To avoid this, we discretize the problem. Definition 3.1 Let A 1,...,A NA denote buckets of the domain of A, and B 1,...,B NB denote buckets of the domain of B (the definition of buckets is given in Definition 2.6). Consider a two-dimensional N A N B pixel grid G, consisting of N A N B unit squares called pixels. G(i, j) is the (i, j)-th pixel, where i and j are called the row number and column number, respectively. The i-th row G(i, ) ofg is its subset consisting of all pixels whose row numbers are i. The j-th column G(,j)ofG is its subset consisting of all pixels whose column numbers are j. Geometrically, a row is a horizontal stripe and a column is a vertical stripe. We use the notation n = N A N B. In our typical applications, the ranges of N A and N B are from 20 to 500, and thus n is between 400 and 250, 000. For simplicity, we assume from now on that N A = N B = N, although this assumption is not essential.

59 3.2. PRELIMINARIES 49 Definition 3.2 For a set of pixels, the union of pixels in it forms a planar region, which we call a pixel region. Definition 3.3 Let I be a range of row numbers and J be a range of column numbers. Then R = {G(i, j) i I,j J} is a rectangular region. Definition 3.4 A pixel region is x-monotone if its intersection with each column is undivided (and thus a vertical range). Definition 3.5 A pixel region is rectilinear convex if both its intersection with each column and its intersection with each row are undivided. From those definitions, a rectangular region is rectilinear convex and a rectilinear convex is x-monotone. Definition 3.6 For each tuple t, t[a] and t[b] are values of the numeric attributes A and B at t. We define a mapping f from the set of all tuples to the grid G as f(t) =G(i, j) such that t[a] is in the i-th bucket and t[b] is in the j-th bucket in the respective bucketings. Definition 3.7 We call the number of tuples mapped to a pixel G(i, j) the support of G(i, j), and denote it u i,j. A tuple is called a success tuple if the objective condition C holds for it. We call the number of success tuples mapped to a pixel G(i, j) the hit support of G(i, j), and denote it v i,j. Give a pixel region R, define support(r) = u i,j, and hit(r) = v i,j. G(i,j) R G(i,j) R We write conf(r) for hit(r)/ support(r) and call it the confidence of R. Definition 3.8 Given a threshold θ, we define g(θ) i,j = v i,j θ u i,j gain(r) = hit(r) θ support(r) = G(i,j) R g(θ) i,j.

60 50 CHAPTER 3. TWO-DIMENSIONAL RULES Two-Dimensional Association Rules Definition 3.9 Let (A, B) R be a primitive condition, which holds for a tuple t if f(t) R. We want to find the association rule of the form ((A, B) R) = C, where R is the rectangular, x-monotone, and rectilinear convex region that optimizes the following criteria as in the case of ranges discussed in Chapter 2. Definition 3.10 An optimized-gain region is the region that maximizes gain. We call a region confident (resp. ample) if its confidence (resp. support) is not less than a given threshold value. We define an optimized-support region, which is the confident region that maximizes support, and an optimizedconfidence region, which is the ample region that maximizes confidence. Note that, if we want to find the arbitrary connected pixel region with the maximum gain, rather than a rectangle, x-monotone, or rectilinear convex region, the problem becomes NP-hard, in line with the argument used by Garey and Johnson [GJ77] for the grid Steiner tree problem. 3.3 Optimized Rectangular Regions There are O(N 4 ) rectangular subregions of G. Thus, a naive algorithm computes the gain of each of these O(N 4 ) rectangles and outputs the one with the maximum gain. The time complexity of this algorithm is O(N 5 )=O(n 2.5 ), which is too expensive. It can be easily reduced to to O(n 1.5 ) by transforming the problem into the computation of optimized ranges. We choose a pair r<r of rows in G, and consider only rectangles whose horizontal edges are on these rows. For each column index j, define: r u j = u i,j and v j = i=r r v i,j i=r Consider buckets B 1,B 2,...,B N such that the number of tuples in B j is u j and the number of success tuples in B j is v j. From Theorem 2.1, 2.2 and

61 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS , we can compute the optimized-gain, -support, and -confidence ranges in O(N) time. Since there are O(N 2 ) candidate pairs of rows, the optimize-gain, -support, and -confidence rectangles can be computed in O(N 3 )=O(n 1.5 ). Theorem 3.1 Each of the optimized-gain, -support, and -confidence rectangular region can be computed in O(n 1.5 ) time. For the case of the optimized-gain rectangles, the same time complexity can be found in Fischer et al. [FHLL93]. Further improvement of this time complexity is given as a research problem in Programming Pearls [Ben84]. This time complexity is a little higher than that for the optimized x- monotone regions, and thus a data-mining system using rectangles is considerably slower than one using x-monotone regions. Furthermore, in our experience, the use of non-rectangular regions often yields useful rules. For example, let us consider Age and Salary as the numeric attributes, and GoldCard as the objective condition. Here, we would expect to find a rule that, among people on the same salary, younger ones are more likely to pay an annual fee for premium credit cards. This expectation is confirmed if we find a two-dimensional association rule whose region resembles a triangle, which is an x-monotone region, but (of course) not a rectangle. Consequently, we believe that x-monotone regions are better than rectangles for our class of regions in two-dimensional association rules. 3.4 X-monotone and Rectilinear Convex Regions Computing Optimized-Gain x-monotone Regions Since the intersection of an x-monotone region with a column is an interval, it would appear that if we compute the maximum gain range in each column and compute the union of all those ranges, we can compute the optimizedgain x-monotone region. Unfortunately, this region is often disconnected, although the connectivity of a region is very important for creating a good rule. The following algorithm is essentially the same as that given by Asano

62 52 CHAPTER 3. TWO-DIMENSIONAL RULES et al. [ACKT96] for solving an image segmentation problem. However, to the best of our knowledge, this is the first case in which an algorithm using fast matrix searching [AKM + 87] routines has been implemented in a database system. Definition 3.11 For each m = 1, 2,..., N, we pre-compute the indices bottom m (s) and top m (s) for all 1 s N, where bottom m (s) and top m (s) are defined so that top s m(s) g(θ) i,m and g(θ) i,m i=bottom m(s) are maximized, respectively. Lemma 3.1 The indices top m (s) and bottom m (s) for all s =1, 2,...,N can be computed in O(N) time. Proof Define Sum m [<j]= j 1 i=1 g(θ) i,m for 2 j N, and Sum m [< 1] = 0. bottom m (s) maximizes s i=bottom m(s) i=s g(θ) i,m = Sum m [< (s + 1)] Sum m [< bottom m (s)], which is equivalent to the property that bottom m (s) minimizes Sum m [< bottom m (s)]. Define bottom m (1) = 1. For each s = 2,...,N, compute bottom m (s) from bottom m (s 1) in the following manner: If Sum m [<s] < Sum m [< bottom m (s 1)], Sum m [< bottom m (s)] is minimum when bottom m (s) =s. Otherwise, Sum m [< bottom m (s)] is minimum when bottom m (s) =bottom m (s 1). During the above step, we only need to scan g(θ) i,m once for each i = 1,...,N. The computation of top m (s) is analogous. Definition 3.12 For two indices s and s, we define cover m (s, s ) as follows: topm(s ) cover m (s, s i=bottom )= g(θ) m(s) i,m if s s, topm(s) i=bottom m(s ) g(θ) i,m if s>s.

63 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS 53 cover(i,j) xxx xxx xxxxx xxx xxxxx xxx xxx xxxxxx xxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx F(j, m) xxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxx xxxxxx xxxx xxx xxxxxx xxxx xxx xx xxxx xxx j i (m+1)th column Figure 3.3: F (i, m + 1) and the associated region Definition 3.13 Let G(, m) be the part of the grid on the left of the m-th column, including G(,m). We define F (i, m) to be the maximum gain of x-monotone regions that contain the pixel G(i, m) and are contained in the region G(, m). Then, we have the following formula: F (i, m + 1) = max 1 j N {max(f (j, m), 0) + cover m+1(i, j)}. (3.1) When F (j, m) is negative, we do not connect the m-th and (m + 1)- th columns. Figure 3.3 illustrates this formula and the associated region. By using Formula (3.1), we can compute max m {max i F (i, m)} and the associated region, which must be the optimized-gain region. Lemma 3.2 Suppose that F (j, m) is given for j =1, 2,...,N. We can compute F (i, m +1) for all i =1, 2,...,N in O(N) time.

64 54 CHAPTER 3. TWO-DIMENSIONAL RULES Proof Define D(i, j) = F (j, m) +cover m (i, j). We can see that the upper and lower triangle parts (D + and D, respectively) of the matrix D are totally monotone matrices. A matrix M is called totally monotone if M(i, j)+m(i +1,j +1) M(i, j +1)+M(i +1,j) for every 1 i<j+1 N. It is well known [AKM + 87] that all locations of the row maxima of this matrix can be computed in O(N) time (of course, we cannot afford to construct the matrix in order to obtain this time complexity). Thus, in O(N) time, we can compute all the row maxima of D + and D, and consequently, of D. By definition, F (i, m + 1) is the i-th row maximum of D. Theorem 3.2 The optimized-gain x-monotone region for a threshold θ can be computed in O(n) time. Proof We solve Formula (3.1) for m =1, 2,...,N. This requires O(N 2 )= O(n) time Computing Optimized-Gain Rectilinear Convex Regions As we have seen in the previous section, we can compute optimized-gain x-monotone regions very efficiently. However, one defect pointed out by several people who have observed our output region rules is that although the output x-monotone region usually gives an intuitive idea on the association between attributes, it sometimes happens that the region is wildly notched and that it is difficult to speculate on the meaning of the rule. Since speculation through visualization is very important for users who require decision support knowledge, this is a serious problem. Moreover, if we use a very fine bucketing, the shape of the region tends to be very sensitive to the sampling if we use a sample subset of the database to construct the region [YFM + 97]. This dependency on the sample should be avoided, since we want to make the rule applicable not only to data in the database but also to unknown data in general; however, it is often tedious to tune the size of bucketings.

65 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS 55 In order to resolve these defects, one idea is to design a tool to smooth the shape of grid regions, and apply the tool to the x-monotone region output by our algorithm; However, this may result in a loss of optimality, and we would like to avoid this kind of heuristics as far as possible. Our solution keeps the optimality formulations but replaces the family of x-monotone regions by the family of rectilinear convex regions. A region is rectilinear convex if it is both x-monotone and y-monotone. We show below that the optimized-gain region in this family can be computed in O(N 3 )=O(n 1.5 ) time. This gives a more intuitive region and also it is stable for the data sampling even if we use a fine bucketing (reported in [YFM + 97]). As a trade-off, it is more expensive to compute the optimized rectilinear convex regions than x-monotone regions; therefore, we should provide the both functions so that the user can flexibly choose whichever is better suited to the application and data. Let R be a rectilinear convex region. Let m 1 and m 2 respectively denote the indices of the first and last columns. Let s(i) and t(i) denote the indices of the bottom and the top pixels of the i-th column. Since R is a rectilinear convex region, the sequence (s(m 1 ),s(m 1 +1),...,s(m 2 1),s(m 2 )) of indices of top pixels from left to right increases monotonically up to some column and then decreases monotonically (possibly one of these two monotone parts is empty). Similarly, the sequence of indices of bottom pixels decreases monotonically up to some column and then increases monotonically. Therefore, we can cut a rectilinear convex region vertically into at most three monotone pieces each of which is one of the four types defined in the following: Definition 3.14 A region that gets wider from left to right: This type of region is named W. Regions that slant upward or downward: These types of region are named U and D, respectively. A region that gets narrower from left to right: This type of region is named N. The top picture in Figure 3.4 illustrates the case in which the sequence of the top pixels (the top sequence, for short) and the sequence of the bottom pixels (the bottom sequence, for short) increases monotonically or decrease monotonically. The middle left part of Figure 3.4 shows the case in which the top se-

66 56 CHAPTER 3. TWO-DIMENSIONAL RULES W U D N W D U N W U D N W D N W N W U N Figure 3.4: Partition of a rectilinear region into monotone parts

67 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS 57 quence increases monotonically up to some column and then decreases monotonically, while the bottom sequence increases monotonically or decreases monotonically. We have two types of regions. One is a combination of a W -type sub region and a D-type subregion, which will be referred as a WDtype region. The other is a combination of a U-type subregion and an N-type subregion, which will be called a UN-type region. The middle right part of Figure 3.4 shows the case in which the top sequence increases or decreases monotonically, while the bottom sequence decreases up to some column and then increases. Similarly, we have two types of regions: WU-type and DN-type. The bottom part of Figure 3.4 shows the case in which the top sequence increases up to some column and then decreases, while the bottom sequence decreases until some column and then increases. We have three types of regions: WDN, WN, and WUN. Theorem 3.3 The optimized-gain rectilinear convex regions for a threshold θ can be computed in O(N 3 ) time. Proof Our algorithm is based on the dynamic programming paradigm. Let g i,j denote the gain of the pixel G(i, j), which is v i,j θu i,j. Before running the dynamic programming program, we pre-compute j [s,t] g m,j, which will be denoted by g m,[s,t], for m =1,...,N and 1 s t N. This computation takes O(N 3 ) time. Let R W (m, [s, t]) denote the rectilinear convex region that maximizes the gain among all W -type rectilinear convex regions whose last column is the m-th one and whose intersections with the m-th column range from the s-th pixel to the t-th pixel. Let f W (m, [s, t]) be the gain of R W (m, [s, t]). For m = 1, f W (1, [s, t]) = g 1,[s,t]. For m > 1, if s = t, f W (m, [s, s]) = max{g m,s,f W (m 1, [s, s]) + g m,s }. Consider the case when s<t. Since the region is of type W, the (m 1)-th column of R W (m, [s, t]) must be a subinterval of [s, t]. If it is a subinterval of [s, t 1] (resp. [s +1,t]), we can observe that the region must contain R W (m, [s, t 1]) (resp. R W (m, [s+1,t])). If R W (m, [s, t]) contains neither R W (m, [s +1,t]) nor R W (m, [s, t 1]), its (m 1)-th column must be the interval [s, t]. Hence, the following recurrence

68 58 CHAPTER 3. TWO-DIMENSIONAL RULES holds: f W (m, [s, t]) = max f W (m 1, [s, t]) + g m,[s,t] f W (m, [s +1,t]) + g m,s f W (m, [s, t 1]) + g m,t On the basis of this recursion formula, we can compute f W (m, [s, t]) for all m and [s, t] ino(n 3 ) time. Next, let R U (m, [s, t]) denote the rectilinear convex region that maximizes the gain among all U and WU rectilinear convex regions whose last column is the m-th one and whose intersections with the m-th column range from the s-th pixel to the t-th pixel. Let f U (m, [s, t]) be the gain of R U (m, [s, t]). For m = 1, f U (1, [s, t]) = g 1,[s,t]. For m>1, we pre-compute max i s f W (m 1, [i, t]) and max i s f U (m 1, [i, t]) for all s t, which can be done in O(N 2 ) time. Since the region is of type U, the (m 1)-th column of R U (m, [s, t]) should be an interval [i, j] such that i s and j t. Ifj t, we can observe that the region must contain R U (m, [s, t 1]). Hence, we have the following recurrence, which leads us to an O(N 3 ) time dynamic programming algorithm: max i s f W (m 1, [i, t]) + g m,[s,t] max i s f U (m 1, [i, t]) + g m,[s,t] f U (m, [s, t]) = max f U (m, [s, t 1]) + g m,t (or g m,t when s = t) Next, let R D (m, [s, t]) denote the rectilinear convex region that maximizes the gain among all D or WD rectilinear convex regions whose last column is the m-th one and whose intersections with the m-th column range from the s-th pixel to the t-th pixel. Let f D (m, [s, t]) be the gain of R D (m, [s, t]). For m = 1, f D (1, [s, t]) = g 1,[s,t]. For m>1, we pre-compute max t i f W (m 1, [s, i]) and max t i f U (m 1, [s, i]) in O(N 2 ) time. Since the region is of type D, the (m 1)-th column of R D (m, [s, t]) should be an interval [i, j] such that i s and j t. If i s, we can observe that the region must contain R D (m, [s +1,t]). Hence, we have the following recurrence, which leads us to

69 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS 59 an O(N 3 )-time dynamic programming algorithm: max t i f W (m 1, [s, i]) + g m,[s,t] max t i f D (m 1, [s, i]) + g m,[s,t] f D (m, [s, t]) = max f D (m, [s +1,t]) + g m,s (or g m,s when s = t) Finally, let R N (m, [s, t]) denote the rectilinear convex region that maximizes the gain among all rectilinear convex regions such that their types are N, UN, DN, WDN, WN,orWUN, and their last columns are the m-th one, and their intersections with the m-th column range from the s-th pixel and the t-th pixel. Let f N (m, [s, t]) be the gain of R N (m, [s, t]). For m =1, f N (1, [s, t]) = g 1,[s,t]. For m>1, we have the following recurrence (mirror formula to that for the W -type): f W (m 1, [s, t]) + g m,[s,t] f U (m 1, [s, t]) + g m,[s,t] f D (m 1, [s, t]) + g m,[s,t] f N (m, [s, t]) = max f N (m 1, [s, t]) + g m,[s,t] f N (m, [s 1,t]) g m,s 1 f N (m, [s, t + 1]) g m,t+1 In this case, we need to compute f N (m, [s, t]) by using f N (m, [s 1,t]) and f N (m, [s, t + 1]). Thus we first compute f N (m, [1,N]) = max{f N (m 1, [1,N]) + g m,[1,n ],g m,[1,n ] }, and run the dynamic programming program to compute the value of f N (m, I) from larger intervals I to smaller ones. Consequently simple dynamic programming gives us an O(N 3 ) time solution for computing f W (m, [s, t]), f U (m, [s, t]), f D (m, [s, t]), and f N (m, [s, t]) for all m and s t. We select the one with the maximum gain. The space complexity is naively O(N 3 ), but can be easily reduced to O(N 2 ) Computing Optimized-Confidence and -Support Regions Exactly Unfortunately, if we consider x-monotone regions or rectilinear convex region, the optimized-support region and the optimized-confidence region are

70 60 CHAPTER 3. TWO-DIMENSIONAL RULES difficult to compute. We only know an O(n 1.5 M)-time algorithm for computing the optimized-confidence and the optimized-support x-monotone region, where M is the total number of tuples (in the case of rectilinear convex regions, the time complexity becomes O(n 2 M)), which is impractical for huge databases. Moreover, for each of them, it can be shown that no algorithm running in polynomial time with respect to n and log M exists unless P = NP. Let M denote the total number of tuples in the given database. Suppose that we have n (= N N) pixels. For each pixel G(i, j), let u i,j denote the number of tuples mapped to G(i, j), and let v i,j denote the number of success tuples mapped to G(i, j). Theorem 3.4 The optimized-confidence or the optimized-support x-monotone region can be computed in O(n 1.5 M) time. Proof Consider the set of x-monotone regions each of which satisfies the conditions that its support is k, all of its pixels are to the left of the m- th column, and its intersection with the m-th column is the subsequence of pixels ranging from G(s, m) to G(t, m). If the set is non-empty, let f(k,m, [s, t]) denote the maximum hit of all regions in the set. Otherwise, define f(k,m, [s, t]) =. Suppose that we have computed f(k,m, [s, t]) for all k =1,...,M and m, s, t =1,...,N. By scanning f(k,m, [s, t]) once, we can obtain the optimized-confidence (or -support) x-monotone region in O(n 1.5 M) time. First we consider the basic step when m = 1. For all pairs (s, t) of1 s< t N and for each k =1,...,M, f(k,1, [s, t]) can be obtained as follows: ti=s v 1,i if k = t f(k,l, [s, t]) = i=s u 1,i otherwise Next we present the inductive step when m>1. To compute f(k,m, [s, t]), we will use an auxiliary function h(k,m, [s, t]) that is the maximum hit of x- monotone regions each of which meets the conditions that its support is k, all of its pixels are to the left of the m-th column, and its intersection with the m-th column is a sequence that includes the subsequence of pixels ranging

71 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS 61 from G(s, m) tog(t, m). We will later show how to compute h(k,m, [s, t]) from f(k, m, [s, t]). In the following algorithm, we assume h(k,m 1, [s, t]) for computing f(k,m, [s, t]). for each k =1,...,M f(k,m, [s, s]) := v m,s + h(k u m,s,m 1, [s, s]); for each len =2,...,N for each s =1,...,N len +1 t = s + len 1;, for each k =1,...,M f(k,m, [s, t]) := max f(k u m,t,m,[s, t 1]) + v s,t, h(k t i=s u m,i,m 1, [t, t]) + ; t i=s v m,i The above procedure requires O(nM) time. In this procedure, we use h(k,m 1, [t, t]) only, instead of h(k,m 1, [s, t]) for all range [s, t]. However, we need h(k,m, [s, t]) for all range [s, t] in order to compute h(k,m, [t, t]) in O(nM) time, and we give the following algorithm: for each k =1,...,M h(k,m, [1,N]) := f(k,m, [1,N]); for each len = N 1,...,1 for each s =1,...,N len +1 t := s + len 1; for each k =1,...,M h(k,m, [s 1,t]), h(k,m, [s, t]) := max h(k,m, [s, t + 1]), ; f(k,m, [s, t]) where h(k,m, [s 1,t]) = when s 1 < 1, and h(k,m, [s, t + 1]) = when t + 1 > N. We perform the above O(nM)-time procedures for m =2,...,N, and therefore we have given an algorithm for computing the optimized-confidence (or -support) x-monotone region in O(n 1.5 M) time. In the case of rectilinear convex regions, we can construct similar formulas of f W (k,m, [s, t]), f U (k,m, [s, t]), f D (k,m, [s, t]), and f N (k,m, [s, t]) that

72 62 CHAPTER 3. TWO-DIMENSIONAL RULES lead us to a dynamic programming program that computes the optimizedconfidence (or -support) rectilinear convex region in O(n 2 M) time. Theorem 3.5 There exists no algorithm that can compute the optimizedconfidence (or -support) x-monotone region in polynomial time with respect to n and log M unless P=NP. Proof Consider the pixels that have the following properties: u 1,j = v 1,j > 0 for each j =1,...,N u 2,j > 0 and v 2,j = 0 for each j =1,...,N u i,j = v i,j =0ifi 3 Suppose that K is the minimum support threshold such that N j=1 u 1,j < K M. Observe that the optimized-confidence x-monotone region for the minimum support K must contain all of G(1,j 1 ) and some of G(2,j 2 ). To compute the optimized region, we need to find a subset S of {1,...,N} that minimizes j 2 S u 2,j2 under the condition that j 2 S u 2,j2 K N j1 =1 u 1,j1. If the optimized-confidence x-monotone region can be computed in a time polynomial to n and log M, in the same time complexity we can also answer whether or not there exists S such that j 2 S u 2,j2 = K N j1 =1 u 1,j1, which is equivalent to the NP-complete [Kar72]. Consequently, unless P=NP, no algorithm exists for computing the optimized-confidence region in polynomial time with respect to n and log M. The same argument can be carried over to the case of the optimized-support x-monotone region. Suppose that θ is the minimum support threshold such that Nj1 =1 u 1,j1 Nj1 =1 u 1,j1 + N j2 =1 u 2,j2 <θ 1. Then the optimized-support x-monotone region for θ must be the set of all of G(1,j 1 ) and G(2,j 2 ) for j 2 S such that S( {1,...,N}) maximizes j 2 S u 2,j2 under the condition that the confidence of the optimized region is at least θ; that is, Nj1 =1 u 1,j 1 Nj1 =1 u 1,j 1 + j 2 S u 2,j2 θ,

73 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS 63 which is equivalent to ( 1 N θ 1) u 1,j1 u 2,j2. j 1 =1 j 2 S If the optimized-support x-monotone region can be computed in time polynomial to n and log M, in the same time complexity we can answer whether or not there exists S such that ( 1 N θ 1) u 1,j1 = u 2,j2, j 1 =1 j 2 S which is equivalent to the NP-complete subset sum problem. Similar argument can be carried over to the case of rectilinear convex regions, and hence there exists no algorithm that compute the optimizedconfidence (or -support) rectilinear convex region in polynomial time with respect to n and log M unless P=NP Approximating Optimized-Confidence and -Support Regions In this section, we substitute new optimization criteria for the optimizedsupport and optimized-confidence criteria. We will compute x-monotone or rectilinear convex regions that closely approximate the optimized-support/- confidence ones. For each region R, we define a stamp point (support(r), hit(r)). We make the following convexity assumption: Convexity Assumption. Given three regions P 1, P 2, and P 3, let (x i,y i ) be the stamp point of P i for i =1, 2, 3. If x 1 x 2 x 3 and y 2 y 1 +(y 3 y 1 )(x 2 x 1 )/(x 3 x 1 ), we can substitute P 1 or P 3 for P 2 to create a useful association rule. See Figure 3.5. In other words, if the stamp point (x 2,y 2 ) lies below or on the line through stamp points (x 1,y 1 ) and (x 3,y 3 ), we do not use the region P 2 to create a rule. The convexity assumption cannot be theoretically confirmed, since the usefulness of an association rule is not a mathematical concept. In practice, however, we have a huge number of stamp points that are fairly densely

74 64 CHAPTER 3. TWO-DIMENSIONAL RULES hit P 1 P 3 P 2 θ min_s Support Figure 3.5: Convexity assumption scattered on the upper convex hull of all stamp points in the Euclidean plane (see Section 3.6 for an example), as we usually handle more than pixels and a large number of data, and hence for any P 2 there exist points P 1 and P 3 fairly close to P 2. Thus we believe that it is reasonable to use the convexity assumption for computing approximate solutions of optimized rules in a practical data-mining system. Definition 3.15 We call a region that cannot be replaced by other regions according to the convexity assumption a focused region, the name used by Asano et al. [ACKT96] in the field of computer vision. Let us characterize focused regions. Lemma 3.3 An region is focused if and only if it is an optimized-gain region with respect to some confidence threshold. Proof We consider the set S of all stamp points associated with regions. Because of the convexity assumption, a stamp point associated with a focused region must be a point on the upper convex hull of S. Hence, there exists a tangent line to the convex hull of S at this point. Suppose that the tangent

75 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS 65 line has a slope τ. Then, this point maximizes y τx for the set of the stamp points. Accordingly, the corresponding region R maximizes hit(r) τ support(r), and hence the optimized-gain region with respect to the confidence threshold τ. Definition 3.16 For a given confidence threshold, the focused optimizedsupport (x-monotone or rectilinear convex) region is defined as the supportmaximizing confident (x-monotone or rectilinear convex) region that is focused. Example 3.1 P 2 in Figure 3.5 is the optimized-support region for a confidence threshold θ. Since P 2 is hidden inside the convex hull and is not focused, P 1, the focused optimized-support region, is substituted for it. Definition 3.17 For a given support confidence Z, the focused optimizedconfidence (x-monotone or rectilinear convex) region is defined as the confidencemaximizing ample (x-monotone or rectilinear convex) region that is focused. Example 3.2 P 2 in Figure 3.5 is the optimized-confidence region for a support threshold min s, and P 2 is replaced by P 3, the focused optimizedconfidence region. Lemma 3.4 The focused optimized-support x-monotone region and the focused optimized-confidence x-monotone region can be computed in O(n log M) time, and the focused optimized-support rectilinear convex region and the focused optimized-confidence rectilinear convex region can be computed in O(n 1.5 log M) time, where M is the support of the whole grid G. Proof For a tangent line with slope τ, we can compute the optimized-gain x-monotone or rectilinear convex region R for the confidence threshold τ in O(n) time for the x-monotone case and O(n 1.5 ) time for the rectilinear convex case. Note that when τ increases, the confidence of R increases, and the support of R decreases monotonically. Thus, in order to search for the focused optimized { -confidence or -support }{ x-monotone or rectilinear convex } region, we perform a binary search; that is: 1. Compute the region R 0 for τ = 0 and the region R 1 for τ =1.

76 66 CHAPTER 3. TWO-DIMENSIONAL RULES 2. Compute R 2 for the slope of the line R 0 R Repeat this process until we find R and R such that R or R is the the region for the slope of R R. Observe that no focused regions exist between R and R. The above binary search seems to look for whole real numbers. However, since a stamp point has integer coordinate values, each slope τ is a rational number whose denominator and numerator are positive integers in [1,M], and the difference of two such distinct rational numbers is at least 1/M 2. Thus, we can stop the search if the width of the search range is reduced to 1/M 2. Hence, the binary search terminates in O(log M) search steps. Since a focused x-monotone (resp. rectilinear convex) region for a given threshold can be computed in O(n) (O(n 1.5 )) time, the overall time complexity is O(n log M) (O(n 1.5 log M)). Let P 2 be the optimized-support { x-monotone or rectilinear convex } region for the confidence threshold θ. Suppose that P 2 is not focused. Then, let P 1 be the focused optimized-support region for θ, and let P 3 be the focused optimized-confidence region for the support of P 2. Figure 3.5 illustrates this situation. From the definition of the focused optimized { -confidence or - support } region, P 1 and P 3 are the optimized-gain regions associated with the slope of the line P 1 P 3. Thus, the binary search for P 2, which is given in Lemma 3.4, computes P 1 and P 3 in the final step, and therefore we have the following: Lemma 3.5 From the convexity assumption, P 1 and P 3 can replace P 2. The case when P 2 is the optimized-confidence region can be handled similarly. Now we are interested in how close P 1 and P 3 are to P 2. A typical case is that in which P 1 P 2 P 3, as shown in Figure 3.6 (for the 1-dimensional case). If the confidence threshold is 0.5 (i.e., 50%), P 2 is the optimizedsupport range, which has support 200 and confidence 0.5. However, P 2 is not a focused range (i.e., a maximum-gain range), and instead the pair P 1 (support 150 and confidence 0.63) and P 3 (support 350 and confidence 0.485)

77 3.4. X-MONOTONE AND RECTILINEAR CONVEX REGIONS 67 u v P 1 P 2 P 3 Figure 3.6: ranges. Optimized-support range and its approximation by focused u v P P P Figure 3.7: A bad example are found as substitutes for P 2. Both P 1 and P 3 are maximum-gain ranges associated with a threshold = (170 95)/( ), where the gain is It can be observed that it is reasonable to choose P 1 or P 3 to make an association rule instead of P 2. Unfortunately, it can happen even in the 1-dimensional case that P 2 does not resemble P 1 or P 3. If the confidence threshold is 0.5 in the data of Figure 3.7, there is no intersection between P 2 and P 1 nor P 3. To overcome such an abnormal case, we could make P 1 a non-focused region by using a heuristic to decrease the values of v for the pixels in P 1, and find a new focused region that gives a better approximation of P 2. This approach resembles the cutting plane method used in operations research [NKT89] to find a solution in the interior of a feasible region. However, we have not yet implemented this heuristic, since we have yet to encounter an optimized region that generates an important association rule in practical data.

78 68 CHAPTER 3. TWO-DIMENSIONAL RULES 3.5 Visualization Our scheme for two-dimensional data mining transforms a set of tuples into a color image in a pixel grid G. It thus provides an immediate method for visualizing our association rules. Unfortunately, if we naively use (v i,j,u i,j v i,j, 0) for the color vector of the (i, j)-th pixel, it does not always give what humans would regard as a good image: the result is often too dark, and if the confidence threshold is low, the red level is too low for the difference in confidence to be distinguished. We must therefore give transformations that make our rules more visible. Since these transformations should depend on the display system, we do not have a universal formula. However, we have experimentally implemented a transformation method specialized for our demonstration system, so that users can actually see our rules. In the interactive mode of our demonstration system, the user chooses attributes (from about 30 numeric attributes and 100 Boolean attributes), indicates either gain, support, or confidence in order to select a feature of the rule, and inputs a threshold. The system outputs the corresponding optimized region. If the rule is a known one and the user wants to find a secondary rule, the system can remove the obtained optimized region and find a secondary optimized region by applying the same algorithm to the rest of the pixel grid. We also have the animated mode, in which the user can control the threshold (almost) continuously and see the changes in the rule. For this purpose, we must compute focused regions for many different thresholds on the fly. The efficiency of the algorithm for computing a focused region makes this approach practical. 3.6 Experimental Results Performance The algorithms in Section 3.3 and 3.4 have been written in C++, and implemented as functions in our Database SONAR (System for Optimized Numeric Association Rules) prototype. Although we have also tested our system on real databases, we used synthetic data to evaluate its performance. Our ex-

79 3.6. EXPERIMENTAL RESULTS min sup = 10% min sup = 50% min sup = 90% 2 Elapsed time [sec] x20 40x40 60x60 80x80 100x100 # of pixels Figure 3.8: Finding an optimized-confidence rectangle periments were conducted on an IBM PC Power Series 850 with a PowerPC 604 clock rate of 133 MHz and 96 MB of main memory, running under AIX 4.1. We obtained our test data as follows: We first generated random numbers uniformly distributed in [N 2, 2N 2 ] and assigned them to u (i,j). We then assigned 1,...,N 2 to v (i,j) from a cell in a corner to the central cell, like a spiral stair case. Rectangular Regions Figures 3.8 and 3.9 respectively show the execution times needed to find an optimized-confidence rectangle with minimum support of 10%, 50%, and 90%, and an optimized-support rectangle with minimum confidence of 50%, 70%, and 90% for numbers of pixels ranging from to In both cases, the execution time increases almost linearly in O(n 1.5 ). x-monotone Regions To determine the features of our test data, we counted the focused x-monotone regions in the data (Figure 3.10). The number increases almost linearly in

80 70 CHAPTER 3. TWO-DIMENSIONAL RULES min conf = 50% min conf = 70% min conf = 90% 2 Elapsed time [sec] x20 40x40 60x60 80x80 100x100 # of pixels Figure 3.9: Finding an optimized-support rectangle proportion to the square root of the number of pixels. We also did experiments on a few real data sets with sizes of up to in a financial application, and observed that the number of focused regions increases sublinearly to the square root of the number of pixels. Figure 3.11 shows all the stamp points of x-monotone regions on the upper convex hull when the number of pixels is Observe that those stamp points are fairly densely scattered on the upper hull. Figure 3.12 shows the execution time needed to find an optimized-gain x-monotone region (i.e., a focused x-monotone region) with θ = 0.1, 0.5, 0.9 for the range of data sizes from to 1, 000 1, 000. The result of this experiment was what we had expected the execution time needed to find a focused region increases linearly in proportion to the number of pixels. Figures 3.13 and 3.14 respectively show the execution times needed to find a focused optimized-confidence region with minimum support of 10%, 50%, and 90%, and a focused optimized-support region with minimum confidence of 50%, 70%, and 90% for numbers of pixels ranging from to In the both cases, the execution time increases almost linearly in proportion to the number of pixels.

81 3.6. EXPERIMENTAL RESULTS Finding all focused images # of focused images x20 40x40 60x60 80x80 100x100 # of pixels Figure 3.10: Number of focused x-monotone regions 1.4e+06 Focused Images (40 x 40) 1.2e+06 1e Hit e e+06 2e e+06 3e e+06 4e+06 Support Figure 3.11: Stamp points of focused regions

82 72 CHAPTER 3. TWO-DIMENSIONAL RULES theta = 0.1 theta = 0.5 theta = Elapsed time [sec] x x x x x1000 # of pixels Figure 3.12: Finding an optimized-gain x-monotone region min sup = 10% min sup = 50% min sup = 90% 2.5 Elapsed time [sec] x20 40x40 60x60 80x80 100x100 # of pixels Figure 3.13: Finding a focused optimized-confidence x-monotone region

83 3.6. EXPERIMENTAL RESULTS min conf = 50% min conf = 70% min conf = 90% 2.5 Elapsed time [sec] x20 40x40 60x60 80x80 100x100 # of pixels Figure 3.14: Finding a focused optimized-support x-monotone region Rectilinear Convex Regions Figures 3.15 and 3.16 respectively show the execution times needed to find a focused optimized-confidence rectilinear convex region with minimum support of 10%, 50%, and 90%, and a focused optimized-support rectilinear convex region with minimum confidence of 50%, 70%, and 90% for numbers of pixels ranging from to As we expected, execution time increases in proportion to n 1.5 in the both cases Overfitting We experimentally show that an optimized x-monotone region sometimes overfit the training dataset seriously, and therefore tends to fail to give a good prediction for an unseen dataset, while we remark that optimized rectilinear regions do not suffer from this overfitting problem so much. For this experiment we generate synthetic datasets that represent typical cases in practice. Let A and B be numeric attributes such that the domain of both A and B is the interval ranging from 1 to 1, and let C be an objective Boolean attribute. We generate a dataset whose tuples are generated according to the following procedure:

84 74 CHAPTER 3. TWO-DIMENSIONAL RULES min sup = 10% min sup = 50% min sup = 90% 20 Elapsed time [sec] x20 40x40 60x60 80x80 100x100 # of pixels Figure 3.15: Finding a focused optimized-support rectilinear convex region min conf = 50% min conf = 70% min conf = 90% 20 Elapsed time [sec] x20 40x40 60x60 80x80 100x100 # of pixels Figure 3.16: Finding a focused optimized-support rectilinear convex region

85 3.6. EXPERIMENTAL RESULTS Generate random points that are uniformly distributed in [ 1, 1] [ 1, 1]. 2. For each point (t[a],t[b]), we determine the value of t[c] as follows, and add tuple t to the dataset. Let p(x, y) be a function from [ 1, 1] [ 1, 1] to [0, 1]. Set 1tot[C] if a uniformly distributed random number between [0, 1] is less than p(t[a],t[b]) and set 0 to t[c] otherwise. We use the following functions for p(x, y). p linear (x, y) = 1 π exp( (x y)2 2 ), which is the normal distribution N(0, 1/ 2) with respect to the distance between (x, y) and the diagonal axis y = x. p circular (x, y) = 1 π exp( (x 2 + y 2 )), which is the normal distribution N(0, 1/ 2) with respect to the distance between (x, y) and the origin (0, 0). We generate two datasets, each of which consists of 10,000 records. We create one dataset, named D linear by using p linear (x, y) and the other, named D circular by using p circular (x, y). To compare x-monotone regions and rectilinear regions with respect to overfitting, we perform the following N-fold cross validation: 1. We randomly divide a given dataset, D linear or D circular,inton equalsized subsets. 2. We take the union of N 1 subset and use the union as the training dataset for generating optimized regions; that is, we create the optimized-confidence x-monotone region and rectilinear region for the minimum support threshold, say 50%, from the training data. 3. We then use the remaining one subset as the test dataset. We computes the ratio of the number of test success tuples in each optimized region to the number of test tuples in the region, which we also call the confidence of region against the test data.

86 76 CHAPTER 3. TWO-DIMENSIONAL RULES ÃÀÅ¼ É ¼ºË Å¾ÌÃ É Ï ˆ Ïˆ «É ÀÅÀÅ¾ «¼ÊË Š ÏŠ Ï Ï ÄÆÅÆËÆÅ¼ Ï ˆ Ïˆ Š ÏŠ Ï ¼ºËÀÃÀÅ¼ É šæåí¼ï Ï ˆ Ïˆ Š ÏŠ Ï Œ Œ ŒŒ šæå½à»¼åº¼ Figure 3.17: Results for D linear 4. We repeat the above three steps N times, and then we compute the average of all the confidences. We performed 10-fold cross validation for D linear and D circular. Figure 3.17 and 3.18 show the results. First of all, we can observe that rectilinear convex regions in the test data give the highest confidence among the three classes of regions. Secondly, we can see that for the larger number of pixels, the difference between training data and test data increases. These increases are much larger in x-monotone regions than those in other regions, which implies that the optimized x-monotone region overfits the training data. On the other hand, the confidence of rectilinear convex region and rectangular region in the training data is almost stable, even if the number of pixels increases.

87 3.7. CONCLUSIONS 77 ºÀÉºÌÃ É ¼ºË Å¾ÌÃ É Ï ˆ Ïˆ «É ÀÅÀÅ¾ «¼ÊË Š ÏŠ Ï Ï ÄÆÅÆËÆÅ¼ Ï ˆ Ïˆ Š ÏŠ Ï ¼ºËÀÃÀÅ¼ É šæåí¼ï Ï ˆ Ïˆ Š ÏŠ Ï Š ŠŒ Œ Œ šæå½à»¼åº¼ Figure 3.18: Results for D circular 3.7 Conclusions In this chapter, we have discussed the problem of finding numeric association rules of the the form: ((A, B) R) = C, where A and B are numeric attributes, R is a planar region, and C is a target condition of interest. We extended the criteria of optimality of rules introduced in Chapter 2 to the two-dimensional case. Because the problem of finding the optimal region is intractable if we consider arbitrary connected regions, we introduced three basic and handy classes of regions: (1) rectangular regions; (2) x-monotone regions; and (3) rectilinear convex regions. We presented algorithms for computing an optimized-gain, -confidence, or -support rectangular region in O(n 1.5 ) time, where n is the number of pixels, by transforming the problem into that of computing optimized ranges (Section 3.3). We presented an asymptotically optimal algorithm for computing an

88 78 CHAPTER 3. TWO-DIMENSIONAL RULES optimized-gain x-monotone region in O(n) time, which uses sophisticated computational geometry techniques and dynamic programming (Section 3.4.1), and also presented an efficient algorithm for computing an optimized-gain rectilinear convex region in O(n 1.5 ) time (Section 3.4.2). We proved that no algorithm can compute an optimized { -confidence and -support }{ x-monotone and rectilinear convex } region in a time that is polynomial with respect to n and log M unless P=NP, where M is the total number of tuples (Section 3.4.3). To overcome this difficulty, we presented an efficient algorithm for approximating the optimized-confidence and -support regions through the use of optimized-gain regions (Section 3.4.4). We experimentally showed that our algorithms are fast not only in theory but also in practice (Section 3.6.1). We also examined the characteristics of three classes of regions with regard to the prediction of unseen data (Section 3.6.2). We showed that rectilinear convex regions yield better prediction accuracy than x-monotone regions. As a trade-off, it is more expensive to compute optimized rectilinear convex regions than x-monotone regions; therefore both functions should be provided so that the user can flexibly choose whichever is better suited to the application and data.

89 Chapter 4 Parallel Aggregation 4.1 Introduction Aggregation is one of the most important operations in decision support systems (DSSs). Online analytical processing (OLAP) [CCS93] applications often require data to be summarized in very large databases (so-called data warehouses) across various combinations of attributes so that analysts can gain insight into the performance of an enterprise from many different business perspectives. Gray et al. [GBLP95] proposed a data cube operator that generalizes the SQL group-by construct to support multidimensional data analysis. Data cubes provide a typical view of the data in DSS applications. They can be used to visualize the data in a graphical way, and allow data-mining algorithms to find important patterns automatically [FMMT96c, FMMT96b, YFM + 97]. Techniques for effective computation of data cubes have attracted considerable research interest [GHQ95, HRU96, JS96, GHRU96, SAG96, AAD + 96]. Since online means interactive, the response time is crucial in OLAP environments. To reduce the response times, most OLAP systems, including multi-dimensional databases and statistical databases, pre-compute frequently used aggregates. These materialized aggregate views are commonly referred to as summary tables. Summary tables can be used to help answer other aggregate queries. Gupta et al. [GHQ95] discussed how materialized aggregate views can be used to help answer aggregate queries. 79

90 80 CHAPTER 4. PARALLEL AGGREGATION The pre-computation cost should also be small, since it determines how frequently the aggregates are brought up to date. It is difficult to precompute a complete data cube, because such a cube can be very large. Harinarayan et al. [HRU96] presented a greedy algorithm for deciding what subset of data cubes should be pre-computed. The algorithm selects a nearoptimal collection of aggregate queries in a data cube. Recently [AAD + 96] discussed various optimization techniques for computing related multiple aggregates, which may be chosen by the user or by an algorithm such as that in [HRU96]. Query-optimizing methods for aggregate queries have been thoroughly investigated, and are summarized in [Gra93]. A few parallel algorithms for aggregate queries have been reported in the literature. To process data cubes, we need to handle multiple aggregate queries. Applying conventional parallel algorithms to aggregate queries in a data cube operator would appear to be straightforward. But, as we explain later, there may be so many queries that conventional methods will not work well Motivating Example Example 4.1 Consider a relation of tens of millions of bank customers. It has 30 numeric attributes such as Age, CheckAccountBalance, and Fixed- DepositBalance, and 100 categorical attributes such as Sex, Occupation, CardLoanDelay, and CreditCardClass. An analyst wants to discover any unknown patterns and rules in the database. Both the number of records and the number of attributes in this example were taken from a real-life data set. The typical databases often used to explain OLAP applications do not have so many attributes as this example. As we will explain, the total size of aggregates to be pre-computed will be very large, even though the algorithm proposed in [HRU96] can select the best subset of the data cube to reduce the response time for probable queries, since the number of dimensions is so large. Numeric attributes are normally used as measures to aggregate values, but they can be used as dimensions to make groups. Since the domain size of a numeric attribute such as CheckAccountBalance tends to be very large,

91 4.1. INTRODUCTION 81 grouping over such attributes will hardly summarize the data at all (that is to say, the size of the result will be as large as that of the raw data). Analysts are not interested in how many customers have exactly the same amount of money, but would like to know the customer distribution along the numeric dimension. Thus the domain of such numeric attributes should be divided into a certain number of buckets, say 100, and grouping over the bucket number will provide an answer to the question. Analysts often examine such data distributions by using spreadsheets and histograms, to find interesting patterns and rules. As we have 130 dimensions, the size of the whole data cube is huge (2 130 vertices). For simplicity, let us assume that each attribute A i (either numeric or categorical; i = 1, 2,..., 130) has 100 distinct, uniformly distributed values. We adopt the cost model used in [HRU96], in which the aggregate query processing cost is proportional to the size of the source relation of the query. Figure 4.1 shows the size of the data cube. There is no point in materializing an aggregate with four or more dimensions, because the size of the result will not be any smaller than that of the raw data (100 4 is larger than the size of the raw data, 10 7 ) and hence the results of four-or-more-dimensional aggregates provide no help in answering other aggregate queries. Therefore, materializing a three-dimensional aggregate will yield the largest benefit. Similarly, the second-best to the forty-fourth-best choices are three-dimensional aggregates. For all one-dimensional aggregates to be computable not from the raw data but from a materialized view, it is necessary to pre-compute 130/3 = 44 three-dimensional aggregates. If we want to make all two-dimensional aggregates computable from a materialized view, we need to pre-compute at least ( 130 )/3 =2, 795 three-dimensional aggregate queries. Since we assumed that each attribute is distributed uniformly, every three-dimensional 2 query produces =1, 000, 000 tuples. Even though the actual database has skewed data distributions and the tuple size of each query result is very small, the total size of the summary tables will be too large to fit into main memory. Example 4.2 Let us consider another scenario, using the same bank customer database. Aggregate queries can also be used for data mining. Data

92 82 CHAPTER 4. PARALLEL AGGREGATION Dim No. of vertices 130 = = ( ) ( ) ( ) =11.4M 130 ( ) 3 =358K.... A 1 A 2... A A i A j A k No. of tuples R = 10 R.... R < ( ) 2 =8,385 A i A j A i A k A j A k ( ) =130 A 1 A i A j A k A ( 0 ) =1 φ 1 Figure 4.1: Example of a data-cube lattice mining algorithms presented in [FMMT96c, FMMT96b, YFM + 97] can discover interesting association rules with one or two numeric attributes, such as a customer whose age and balance falls in some two-dimensional region tends to delay his/her card loan payment with high probability. The rule characterizes a target property ( unreliable customers) by using two numeric attributes (age and balance). The algorithms can derive the rule from the distribution (count()), in terms of age and balance, of all customers and that of customers who have delayed their payments at some time. Figure 4.2 illustrates the customer distribution and the region of the rule, where the brightness and saturation of each pixel, respectively, represent the number of all customers and the ratio of unreliable customers. Such association rules can be used to construct accurate decision trees and regression trees [FMMT96a, MIM97]. Assuming that a function bucket() returns the bucket number, a pair of naive SQL queries computes these data distributions as follows: SELECT bucket(age), bucket(balance), count() FROM Customers GROUP BY bucket(age), bucket(balance);

93 4.1. INTRODUCTION 83 Distribution of Customers Who Have Delayed Payments Age Balance The most unreliable customers are in this (automatically found) region R ((Age, Balance) R) = (CardLoanDelay = yes) Figure 4.2: Example rule

94 84 CHAPTER 4. PARALLEL AGGREGATION SELECT bucket(age), bucket(balance), count() FROM Customers WHERE CardLoanDelay = YES GROUP BY bucket(age), bucket(balance); Since these two queries have the same group-by clause, we can share the computation for making groups. If count() takes a condition as an argument and returns the number of tuples that satisfy the condition, the following single query will compute the necessary customer distributions: SELECT bucket(age), bucket(balance), count(), count(cardloandelay = YES) FROM Customers GROUP BY bucket(age), bucket(balance); In general, we do not know beforehand which combinations of numeric attributes adequately characterize the target property. Therefore, we usually compute all the potential rules, sort them in order of interestingness (we often use confidence the probability that a rule holds as a measure), and then select good ones. This mining process requires computation of the distributions of all customers and unreliable customers along any two-dimensional combinations of numeric attributes. Since we have 30 numeric attributes, the number of queries for obtaining the necessary two-dimensional data distributions is ( 30 ) = We have focused on the target property represented by the expression CardLoanDelay = yes. However, any conditional expression can be a target, and we may have any number of targets. The simplest form of a target condition is attribute = value. It is natural for data mining to use all pairs of categorical attributes and their values as target properties, because every attribute in the database represents some property of interest. We have 100 categorical attributes, each with between two and several dozens of possible values. Therefore the potential number of target properties will range from 100 to several thousand.

95 4.1. INTRODUCTION 85 Let us assume that we have 500 target properties. Since we can share the computation necessary to create common groupings, the number of queries is still ( 30 ) = 435, which is the same as when we have only one target. But 2 the tuple size of each query result is to be multiplied so that it contains 503 fields (2 for bucket numbers, 1 for counting all customers, and 500 for counting target properties). The typical number of buckets for each numeric attribute is between 20 and 400. Thus the number of groups in a query will be 20 2 = 400 to = 160, 000. Remember that the original data has 130 attributes and 10 7 records. Consequently, the total result is much larger than the original database. As we have seen in the examples, some decision support applications require computation of a large number of aggregate queries on a relation. The results of the queries can be used as materialized views and for data mining. Since serial algorithms would take a very long time, we investigate parallel algorithms on shared-nothing parallel architectures. In this chapter, we focus on computing a given set of aggregate queries on a single relation in parallel. The source relation may be a base relation (raw data) or another previously computed aggregate table. We evaluate three parallel algorithms on the basis of both analytical and empirical studies, and present a way of choosing the best one. Unexpectedly, a broadcasting algorithm outperforms other algorithms for problems of certain sizes. We assume that the part of the data cube to be computed and the relation from which the aggregates are to be computed have already been decided through the use of other query optimization techniques such as [HRU96, AAD + 96]. We also assume that the aggregating functions are distributive [GBLP95]. Distributive functions, such as count(), min(), max(), and sum(), can be partially computed and later combined. The rest of this chapter is organized as follows. In Section 4.2, we introduce three conventional parallel algorithms for processing single aggregate queries. In Section 4.3, we extend those algorithms so that they can handle multiple aggregate queries efficiently, and describe simple analytical cost models of the algorithms. In Section 4.4, we compare the algorithms by using these cost models. In Section 4.5, we describe implementations of the

96 86 CHAPTER 4. PARALLEL AGGREGATION algorithms and give their performance results. In Section 4.6, we present our conclusions and discuss future work. 4.2 Algorithms for Single Queries The literature contains only a few studies of parallel aggregate processing. Two conventional parallel algorithms, Two-Phase and Repartitioning, are mainly used for computing single aggregate queries. These algorithms will be explained later. [BBDW83] analyzed two algorithms for aggregate processing on tightly coupled multiprocessors with a shared disk cache. One is similar to the Two-Phase algorithm, while the other broadcasts all tuples so that each node can use the entire relation. The latter approach has been considered impractical on shared-nothing multiprocessors [SN95]. Recently, however, high-bandwidth multiprocessor interconnects, such as the High-Performance Switch (HPS) [SSA + 95] for the IBM SP2 [AMM + 95], and low-cost gigabit LAN adaptors for PCs, have become available, and will soon be common. Thus we should evaluate the feasibility of the broadcasting approach with high-performance networks using such equipment. In this section, we describe the Two-Phase, Repartitioning, and Broadcasting algorithms for processing single aggregate queries. We assume that the source relation is partitioned in round-robin fashion, and that aggregation on a node is performed by hashing. The underlying uniprocessor-hash-based aggregation works as follows: 1. All tuples of the source relation are read, and a hash table is constructed by hashing on the group-by attributes of the tuple. Note that the memory requirement for the hash table is proportional to the number of groups seen. 2. When the group values do not all fit into the memory allocated for the hash table, the tuples that do not belong to groups in the memory are hash-partitioned into multiple (as many as necessary to ensure no future memory overflow) bucket files on disk. 3. The overflow bucket files are processed one by one in the same way as in Step 1.

97 4.2. ALGORITHMS FOR SINGLE QUERIES Two-Phase Algorithm (2P) The Two-Phase (2P) algorithm [Gra93] simply partitions input data. In its first phase, each node (processor) in the system computes aggregates on its local partition of the relation. In the second phase, these partial results are collected to one of the nodes, which merges them to produce the final results. The second phase can be parallelized by hash-partitioning on the group-by attributes. As explained in [BBDW83, SN95], 2P has two problems when the grouping selectivity is high. First, since a group value may be accumulated on all the nodes, the memory requirement for each node can be as large as the overall result, which may not fit into main memory. Second, duplication of aggregation work in the first phase and the second phase becomes significant [SN95]. As we have seen in the examples, the total size of the aggregates may be very large in DSS applications. Thus the effectiveness of this algorithm is not clear Repartitioning Algorithm (Rep) The Repartitioning algorithm [SN95] partitions groups, and can thus work well for large numbers of groups. It redistributes the data on the group-by attributes, and then performs aggregation on each partition in parallel. This algorithm is efficient when the grouping selectivity is high, because it eliminates duplication of work by processing each value for aggregation just once. It also reduces the memory requirement, since each group value is stored in only one place. However, when the grouping selectivity is so low that the number of groups is smaller than the number of processors, this algorithm cannot use all the processors, which severely affects the performance Broadcasting Algorithm (BC) The Broadcasting (BC) algorithm [BBDW83] broadcasts all local disk pages to all nodes. Receiving the entire source relation, each node computes aggre- 1 Likewise, the second phase of 2P cannot use all the processors for small numbers of groups, but this hardly affects the performance.

98 88 CHAPTER 4. PARALLEL AGGREGATION gations for groups assigned to it. Since each group value is stored in only one place, the memory requirement of this algorithm is the same as that of the Rep algorithm. The BC algorithm computes both groups and aggregations on receiving nodes. In contrast, the Rep algorithm computes groups of tuples on nodes where the source data are read. When the system does not support broadcast efficiently, this algorithm is inferior to the Rep algorithm, since the amount of data passed through the network is larger than when the Rep algorithm is used. When the number of groups is smaller than the number of nodes, the BC algorithm cannot use all the processors, for the same reason as in Rep. 4.3 Algorithms for Multiple Queries Let Q be the number of queries. Obviously, by applying the algorithms for single aggregate queries sequentially Q times, we can process all of the queries. However, such trivial algorithms can be improved by using the fact that all the queries have the same source relation. In this section, we describe three algorithms for multiple queries, m2p, mrep, and mbc, which are based on the 2P, Rep, and BC algorithms, respectively. The main idea is that by processing multiple queries simultaneously we can share common processes. We also present simple analytical cost models, which were developed to predict the relative performance of algorithms for various execution environments and problem sizes. Similar models have been presented previously [SN95]. We compare the algorithms in Section 4.4. For simplicity, the models include no overlaps between CPU, I/O, and message-passing operations, and all processors work completely in parallel. These assumptions allow us to compute the performance by summing up the partial I/O and CPU performance on a single node. The parameters of the models are listed in Table 4.1. We assume that there are Q aggregate queries on a single relation, and that each query contains A aggregate functions. We also assume that these queries are being performed directly on a base relation stored on disks. R is the number of tuples in the source relation, and S is the grouping selectivity, which is the

99 4.3. ALGORITHMS FOR MULTIPLE QUERIES 89 ratio of the number of groups to the number of tuples of the source relation. Therefore, the number of groups G is equal to R S. p represents the projectivity, which is the ratio of the output tuple size to the input tuple size. To model the data-mining example, we make A and p high (100 and 50%, respectively). Most parameters were measured by using Database SONAR [FMMT96d], a prototype data-mining system running on a 16-node IBM SP2 [AMM + 95] with a High-Performance Switch (HPS) [SSA + 95]. The SP2 also has an ordinary 10 Mb/s Ethernet. Including the MPI [MPI94] protocol overhead, the point-to-point transfer rate of the HPS is about 40 MB/s, and that of the Ethernet is about 0.8 MB/s. Note that the HPS transfer rate is much faster than the ordinary disk I/O speed (about 5 MB/s) Two-Phase Algorithm (m2p) The m2p algorithm processes Q queries simultaneously. In its first phase, each node reads its local source relation, and computes aggregations of Q queries. The second phase merges the partial results in parallel. Since only a single scan of source data is necessary, the scan cost is minimum. However, the memory requirement is multiplied by Q, and 2P requires as much main memory for each node as the overall result size. Therefore memory overflow is liable to happen, necessitating extra IOs for bucket files. If we divide Q queries into 1 T Q parts (each part consists of Q/T queries), and process each part simultaneously, the cost of scans will increase on the one hand, and the cost of extra IOs will be reduced on the other. Therefore, to achieve the best performance, we may have to control the number of scans (T ) and trade off the cost of extra IOs against the cost of scans. The cost model of this algorithm is as follows: 1. Scanning cost: (R i /P) IO T 2. Selection cost: R i t r T 3. Local aggregation cost: R i (t h + t a A) Q 4. Cost of extra write/read required for the tuples not processed in the first pass: (R i p Q M/S l T )/P 2 IO

100 90 CHAPTER 4. PARALLEL AGGREGATION Symbol Description Value N Number of nodes Variable R Size of relation in bytes 5GB R No. of tuples in R 10 million R i No. of tuples on node i R /N R p No. of repartitioned tuples max( R i, 1/S/Q) Q No. of aggregate queries Variable T No. of scans of the source data Variable A No. of aggregate functions / query 100 S Grouping selectivity of a query Variable S l Phase 1 selectivity (for 2P) min(sn,1) S g Phase 2 selectivity (for 2P) min(1/n, S) p Projectivity 0.5 G No. of result tuples R S G i No. of result tuples on node i R i S l P Page size 4KB M Size of hash table 50 MB mips CPU speed 120 MIPS IO Time to read a page 0.8 ms t r Time to read a tuple 200 / mips t w Time to write a tuple 200 / mips t h Time to compute hash 200 / mips t a Time to compute an aggregate 40 / mips t m Time to send a page 0.1 ms t b Time to broadcast a page (N 1)t m Table 4.1: Parameters for the cost model

101 4.3. ALGORITHMS FOR MULTIPLE QUERIES Cost of generating result tuples: G i t w Q 6. Cost of sending/receiving: G i /P t m Q 7. Cost of computing the final aggregates: G i (t r + t a A) Q 8. Cost of generating result tuples: G i S g t w Q 9. Cost of extra write/read required for the tuples not processed in the first pass: (G i Q M/S g T )/P 2 IO 10. Cost of storing to local disk: G i S g /P IO Q The cost model of 2P for single queries is its special case when Q = T = Repartitioning Algorithm (mrep) The mrep processes multiple queries simultaneously, using Rep for each query. It seems to be suitable for multiple queries, since multiple queries require a lot of main memory and the base algorithm (Rep) uses main memory efficiently. However, we need to make partitioning independent for each query, because each query may have a different group-by clause. Therefore, to process Q queries simultaneously, each node reads a tuple, computes the groups that the tuple belongs to for all Q queries, and then sends the tuple to Q (possibly different) nodes. The communication cost of the mrep algorithm for Q queries is Q times larger than that of the Rep algorithm for a single query. The cost model of this algorithm is as follows: 1. Scanning cost: (R i /P) IO T 2. Selection cost: R i t r T 3. Cost of hashing to find the destination and writing to communication buffer: R i (t h + t w ) Q 4. Repartitioning send/receive: R p /P p t m Q 5. Aggregation cost: R p (t r + t a A) Q

102 92 CHAPTER 4. PARALLEL AGGREGATION 6. The tuples not processed in the first pass need an extra write/read: (R p p Q M/S T )/P 2 IO 7. Cost of generating result tuples: R p S t w Q 8. Cost of storing to local disk: R p S/P p IO Q Broadcasting Algorithm (mbc) The mbc algorithm simply processes Q queries simultaneously, using BC for each query. As we will see in the next section, for a single query the BC algorithm is slower than the others, since its network cost is very high. However, the communication cost of the mbc algorithm is the same for multiple queries as for a single query, while the other algorithms have communication costs proportional to Q. Therefore, the mbc algorithm may outperform the other algorithms when there are multiple queries. Broadcasting is usually more expensive than simple point-to-point message passing. When the network does not efficiently support broadcasting and allows only one point-to-point communication at a time, as in the case of ordinary Ethernets, it is necessary to use serialized N (N 1) point-topoint communications for an all-to-all broadcast. 2 Therefore, letting t m be a point-to-point communication cost, we assume that the broadcasting cost t b is N (N 1) t m. Some multiprocessor interconnects such as ATM switches and HPS for IBM SP2 allow multiple pairs of processors to communicate simultaneously. When this kind of network is available, an all-to-all broadcast can be performed in only N 1 stages of point-to-point communication. Thus, the all-to-all broadcast cost t b is (N 1) t m. Since our test bench, an IBM SP2, has the above kind of network, we use this model of the broadcasting cost. The following is the cost model of the broadcasting algorithm: 1. Scanning cost: (R i /P) IO T 2. Broadcasting cost: (R i /P) t b T 2 Each node sends data to all the other nodes simultaneously. In the MPI standard, this type of communication is referred to as all-gather.

103 4.4. ANALYTICAL EVALUATION Cost of getting tuples from the communication buffer: R t r T 4. Hashing cost: R t h Q 5. Aggregation cost: R p t a A Q 6. Cost of extra write/read required for the tuples not processed in the first pass: (R p p Q M/S T )/P 2 IO 7. Cost of generating result tuples: R p S t w Q 8. Cost of storing to local disk: R p S/P p IO Q Number of Scans All the algorithms have trade-offs between the scan cost and the cost of extra IOs. As we can see from the cost models, we need to know the grouping selectivities (S) of queries in order to optimize the number of scans. The grouping selectivities are also necessary for deciding a good subset of a data cube to be pre-computed (using the algorithm presented in [HRU96]). Statistical procedures such as [HNSS95] can be used to estimate the grouping selectivities. Thus we assume in the following sections that we have an estimate of the grouping selectivity for each query. 4.4 Analytical Evaluation In this section, we evaluate the performance of the algorithms from various perspectives Grouping Selectivity As explained in [SN95], m2p and mrep are expected to be sensitive to grouping selectivity. Figure 4.3 shows the relationship between the number of groups per query, G (proportional to the grouping selectivity) and the response times of these algorithms for a configuration of 16 processors with high-speed interconnects. The number of queries (Q) is 100, in this case. Surprisingly, mbc outperforms both m2p and mrep when the number of

104 94 CHAPTER 4. PARALLEL AGGREGATION groups falls within the range [5 10 3, ]. mrep wins when the selectivity is within the range [ , ]. In this range of selectivity, mrep iteratively scans the source relation to avoid extra IOs, because iterative scans are cheaper than extra IOs. When the grouping selectivity is higher than , however, mbc wins again. The reason for this is as follows. The high grouping selectivity requires so many scans to avoid extra IOs that the cost of iterative scans becomes more than the cost of extra IOs. Therefore, both mrep and mbc scan the source relation only once and use necessary extra IOs. The costs of extra IOs of mrep and mbc are the same, but the communication cost of mrep for a single scan is p Q/(N 1) times larger than that of mbc, which affects the performance of mrep. When the number of groups is small ([1, ]), m2p is a little faster than mbc. m2p has a shoulder (a point at which the response time starts to increase), indicating that the local result overflows the main memory of a single node ( G = M /Q 2, 000). mbc also has a shoulder, which shows that the overall result becomes larger than the total main memory of the multiprocessor ( G = N M /Q 32, 000). The switching point of m2p and mbc is within the range of typical numbers of groups ([400, 160, 000]) for the data mining applications explained in Example 4.2. Figure 4.4 shows a case in which there is only one query. As expected, m2p works well when the selectivity is low, while mrep beats the other algorithms when the selectivity is high. mbc does not win for any selectivities. Figure 4.5 compares the best performance of the proposed algorithms (m2p, mrep, and mbc) with that of the conventional algorithms (2P, Rep, and BC) when there are 100 queries. We can see that the best of the proposed algorithms is about 4 times faster than the best of the conventional algorithms for a wide range of grouping selectivities Number of Queries To determine how many queries are necessary for mbc to outperform other algorithms, we examined their performance, fixing the number of groups. Figure 4.6 shows the relationship between the number of queries and the

105 4.4. ANALYTICAL EVALUATION N = 16, Q = Response time [sec] m2p mrep mbc e+006 1e+007 Number of groups per query ( G = S R ) Figure 4.3: Selectivity vs. performance (Q=100) N = 16, Q = 1 m2p mrep mbc Response time [sec] e+006 1e+007 Number of groups per query ( G = S R ) Figure 4.4: Selectivity vs. performance (Q=1)

106 96 CHAPTER 4. PARALLEL AGGREGATION N = 16, Q = 100 Best of proposed algorithms Best of conventional algorithms Response time [sec] e+006 1e+007 Number of groups per query ( G = S R ) Figure 4.5: Proposed vs. conventional response times when the number of groups per query ( G ) is50, 000. We can see that mbc outperforms other algorithms when there are 25 or more queries Network Speed The Broadcasting algorithm has been considered impractical in view of the slowness of conventional networks. Figure 4.7 shows how the performance of three algorithms depends on the network speed when N = 16, Q = 100, and G = 50, 000. mbc can outperform other algorithms when the network speed is more than 5 MB/s. The speed of ordinary Ethernets is so slow (0.8 MB/s) that mrep and mbc do not work well Speedup and Scaleup Figure 4.8 shows the speedup and scaleup characteristics of the algorithms in a case where the number of queries is 100 and the number of groups per query is 50, 000. The relative speedup is the response time normalized with

107 4.4. ANALYTICAL EVALUATION N = 16, G = 50, Response time [sec] Number of queries (Q) m2p mrep mbc Figure 4.6: No. of queries vs. performance N = 16, Q = 100, G = 50,000 m2p mrep mbc Response time [sec] Transfer rate [MB/sec] Figure 4.7: Network speed vs. performance

108 98 CHAPTER 4. PARALLEL AGGREGATION that of a hash-based serial algorithm on a single node, where the problem size is kept constant. The relative scaleup is also the normalized response time when the number of tuples in the source relation is proportional to the number of nodes. The speedup and scaleup of mbc are better than linear when the number of processors is less than 150. This is because the serial algorithm must employ a lot of extra I/Os to process data for groups that cannot fit into the main memory, while mbc can reduce the I/Os by effective use of the entire main memory of the system. The scaleup of mbc has a peak when the number of nodes is 24. At this point, the main memory size of the entire system is almost the same as the total size of the results, so the whole memory is fully utilized. As the number of processor increases beyond 24, the amount of memory that is not used increases and the broadcasting cost, which is proportional to the number of processors, overshadows the benefit gained. The speedup and scaleup of mrep are superlinear and stable, because mrep uses main memory efficiently, and its communication cost is independent of the number of processors. m2p has a linear scaleup. The speedup of m2p decreases as the number of processors increases, because the partial result size ( G i ) remains constant, and hence the computation cost of the second phase affects the speedup performance. Note that the speedup and scaleup performance are very sensitive to the problem size (Q and G). When the problem is small, m2p scales well, while mrep and mbc do not Switching Points Figure 4.9 shows how the best algorithm for a 16-node system depends on the problem size the number of queries (Q) and the number of groups per query ( G ). The switching points of m2p and mbc lie on a line parallel to the dotted line labeled M = Q G, on which the entire result size is equal to the main memory size of a single node. From this analytical evaluation, we can conclude that none of the algorithms gives a satisfactory performance for the entire range of problem sizes,

109 4.4. ANALYTICAL EVALUATION 99 Speedup (Q = 100, G = 50,000) m2p mrep mbc Linear Relative speedup Number of nodes (N) Scaleup (Q = 100, G = 50,000) m2p mrep mbc Linear Relative scaleup Number of nodes (N) Figure 4.8: Speedup and scaleup

110 100 CHAPTER 4. PARALLEL AGGREGATION 200 N = 16 Number of Queries (Q) M =Q G m2p BC Rep BC e+006 1e+007 Number of groups per query ( G = S R ) Figure 4.9: Switching points and that it is necessary to choose whichever of them is best according to the overall result size and the available memory size. The mbc algorithm for multiple queries may be practical when high-speed networks are available, and if the result size is larger than the main memory size of a single node. 4.5 Empirical Evaluation The cost models that we used in Section 4.4 are very simple; they do not include several factors, such as network contentions and overlaps of I/O and CPU, which may strongly affect the performance. It was therefore necessary to validate our analytical conclusions by conducting experiments on a real system. We implemented m2p, mrep, and mbc in Database SONAR [FMMT96d] using standard MPI communication primitives [MPI94]. Conformity with the standard makes the implementations portable to any parallel architectures, including workstation clusters. All experiments were performed on a 16-node IBM SP2 Model

111 4.5. EMPIRICAL EVALUATION 101 Each node in the system has a 66-MHz POWER2 processor, 256 MB of real memory, a 2-MB L2 cache, and a SCSI-2W 4-GB HDD. The processors all run under the AIX 4.1 operating system and communicate with each other through the High-Performance Switch [SSA + 95] with HPS adaptors. We assigned 50 MB of main memory for hash tables on each node. We randomly generated 10 million 500-byte tuples of test data with 130 attributes, and divided the data evenly among all the nodes of the system. Thus each node had about 298 MB of a relation. Since the prototype system is not a real database system and has no concurrency control, the implementations are more CPU-efficient than complete database systems. The response times shown below are the averages for several runs Grouping Selectivity Figure 4.10 shows the relationship between the number of groups and the response times of the implementations for 100 aggregate queries. We can see almost the same characteristics as in Figure 4.3. The mbc algorithm actually outperforms m2p and mrep when the number of groups per query falls within the range [2 10 3, 10 5 ]. The measured response times are comparable with the response times predicted by the analytical models. Figure 4.11 shows that mbc does not win for any number of groups per query when there is only one query. m2p and mrep work well when the grouping selectivity is low and high, respectively. mbc does not work well for single aggregate queries. Figure 4.12 compares the best performance of the proposed algorithms (m2p, mrep, and mbc) with that of the conventional algorithms (2P, Rep and BC), when there are 100 queries. We can see that the best of the proposed algorithms is about 16 times faster than the best of the conventional algorithms for a wide range of grouping selectivities Switching Points Figure 4.13 shows how the best algorithm for a 16-node system depends on the number of queries and the number of groups per query. We can see almost the same characteristics as in Figure 4.9. The switching points of m2p and

112 102 CHAPTER 4. PARALLEL AGGREGATION Response time [sec] m2p mrep mbc N = 16, Q = e+006 Number of groups per query ( G = S R ) Figure 4.10: Selectivity vs. performance (Q=100) 1000 N = 16, Q = 1 Response time [sec] m2p mrep mbc e+006 Number of groups per query ( G = S R ) Figure 4.11: Selectivity vs. performance (Q=1)

113 4.5. EMPIRICAL EVALUATION N = 16, Q = 100 Best of proposed algorithms Best of conventional algorithms Response time [sec] e+006 Number of groups per query ( G = S R ) Figure 4.12: Proposed vs. conventional mbc lie on a line parallel to the broken line labeled M = Q G. Using this result, we can choose the best algorithm according to the estimated number of groups and the available memory size Speedup and Scaleup Since we have only a 16-node system, we could not measure the performance of a configuration with more than 16 nodes. Figure 4.14 shows the speedup and scaleup performance when there are 100 queries and the number of groups per query is 50, 000. Observe that the speedup and scaleup of mrep and mbc are better than linear, and that those of m2p are almost linear, as explained in Section 4.4. The measured speedup and scaleup performance of mbc decline a little earlier than the analytical cost model predicts. The reason for this is as follows. We used an MPI broadcast primitive in the implementation, which uses log N stages of communications. Each node of the system broadcasts, and only one broadcast can be performed at a time. Thus the necessary number of communication stages is N log N, which is more than estimated by the cost model. As mentioned in Section 4.4, the

114 104 CHAPTER 4. PARALLEL AGGREGATION 200 N = 16 Number of Queries (Q) M =Q G m2p BC Rep e+006 1e+007 Number of Groups per Query ( G = S R ) Figure 4.13: Switching points cost can be reduced to N 1 stages by using point-to-point communication primitives. It is therefore possible to have an implementation that gives a better performance. 4.6 Conclusions We have presented three parallel algorithms Two-Phase (m2p), Repartitioning (mrep), and Broadcasting (mbc) for multiple aggregate queries, and compared them by using analytical cost models and conducting experiments on a real system. Although the Broadcasting algorithm has been considered impractical for today s shared-nothing parallel architectures, we proved that it works well for multiple aggregate queries if high-speed networks are available and if the total result is larger than the main memory of a single node. Recently, high-performance networks have become a kind of commodity. It is thus now practical to use the Broadcasting algorithm for processing multiple aggregate queries on shared-nothing multiprocessors.

115 4.6. CONCLUSIONS Speedup (Q = 100, G = 50,000) Relative speedup m2p mrep mbc Linear Number of nodes (N) 2.5 Scaleup (Q = 100, G = 50,000) 2 Relative scaleup Number of nodes (N) m2p mrep mbc Linear Figure 4.14: Speedup and scaleup

116 106 CHAPTER 4. PARALLEL AGGREGATION The experimental results have the same characteristics as those predicted by the analytical models. These models are so simple that the measured response times cannot exactly match their predicted times. Adjusting the parameters of the cost models will reduce the deviations between them. Since none of the existing algorithms gives a satisfactory performance for the entire range of problem sizes, it is necessary to choose whichever of them is best according to the overall result size and the available memory size. Switching points for the algorithms can be decided by using analytical cost models. In this study, we have ignored the relations among multiple queries and treated them as independent. But actual queries in DSS applications are often interrelated, and thus several optimization techniques can be applied, as explained in [AAD + 96] for serial processors.

117 Chapter 5 Decision Tree 5.1 Introduction Decision Trees Constructing an efficient decision tree is a very important problem in data mining [AGI + 92, AIS93b, BFOS84, MAR96, Qui93]. For example, an efficient computer-based diagnostic medical system can be constructed if a small decision tree can be automatically generated for each medical problem from a database of health-check records for a large number of patients. Let us consider the attributes of tuples in a database. An attribute is called Boolean if its domain is {0, 1}, categorical if its domain is a discrete set {1,...,k} for some natural number k, and numeric if its domain is the set of real numbers. Each data tuple t has m + 1 attributes A i, for i =0, 1,...,m. We treat one Boolean attribute (say, A 0 ) as special, denote it by W, and call it the objective attribute. The other attributes are called conditional attributes. The decision tree problem is as follows: A set U of tuples is called positive (resp. negative) if for a tuple t the probability that t[w ] is 1 (resp. 0) is at least θ 1 (resp. θ 2 )inu, for given thresholds θ 1 and θ 2. We would like to classify the set of tuples into positive subsets and negative subsets by using tests with conditional attributes. For a Boolean (conditional) attribute, a test is in the form of t[a i ] = 1?. For a categorical attribute, a traditional 107

118 108 CHAPTER 5. DECISION TREE Patient BP C S Diseased Table 5.1: Health check records test is t[a i ]=l?. For a numeric attribute, a traditional test is t[a i ] <Z? for a given value Z. Let us consider a rooted binary tree, each of whose internal nodes is associated with a test that has attributes. We associate each leaf node with the subset (called leaf cluster) of tuples satisfying all tests on the path from the root to the leaf. Every leaf cluster is labeled as either positive or negative on the basis of the class distribution in the leaf-cluster. Such a tree-based classifier is called a decision tree. For example, assume that we have a database of health-check records, shown in Table 5.1.1, for a large number of patients with geriatric diseases. Consider a set of health-check items; say, systolic blood pressure (BP), cholesterol level (C), and urine sugar (S). We would like to decide whether a patient needs a detailed health check for a geriatric disease (say, apoplexy). Suppose that blood pressure is a numeric attribute, and that urine sugar and cholesterol level are Boolean (+ or ) attributes in the health check database. Figure 5.1 shows an example of a decision tree that decides whether a patient is diseased or not from these health-check items. We want to construct a compact decision tree. Unfortunately, if we want to minimize the total sum of the lengths of exterior paths, the problem of

119 5.1. INTRODUCTION 109 S + - Yes BP>=100 No C + - Figure 5.1: Decision tree constructing a minimum decision tree that completely classifies a given set of data is known to be NP-hard [HR76, GJ79]. It is also believed that it is NP-hard if the minimized objective is the size, that is, the number of nodes in the tree. However, in practical applications classification accuracy for unseen data is more important than the complete classification of a given data. Therefore, despite the NP-hardness of the problem, many practical solutions (e.g., [BFOS84, Qui86b, QR89, Qui93]) have been proposed in the literature. Among them, the C4.5 program [Qui93] applies an entropy heuristic, which greedily constructs a decision tree in a top-down, breadth-first manner according to the entropy of splitting. At each internal node, the heuristic examines all the candidate tests, and chooses the one for which the associated splitting of the set of tuples attains the minimum entropy value. If each test attribute is Boolean or categorical, these practical approaches work well, and SLIQ [MAR96] gives an efficient scalable implementation, which can handle a database with 10 million tuples and 400 attributes. SLIQ uses the gini index function instead of entropy Handling Numeric Attributes To handle a numeric attribute, one approach is to make it categorical, by subdividing the range of the attribute into smaller intervals. Another approach is to consider a test of the form t[a i ] >Zor t[a i ] <Z, which is called a guillotine cutting, since it creates a guillotine-cut subdivision of the Cartesian space of ranges of attributes. C4.5 and SLIQ adopt the latter

120 110 CHAPTER 5. DECISION TREE Weight Height Weight Height Figure 5.2: Healthy region and guillotine-cut subdivision approach. However, [Qui93] pointed out that this approach has a serious problem if a pair of attributes are correlated. For example, let us consider two numeric attributes, height (cm) and weight (kg), in a health check database. Obviously, these attributes have a strong correlation. Indeed, the region height 2 < weight < height 2 and its complement provide a popular criterion for separating healthy patients from patients who need dietary cures. In the left chart of Figure 5.2, the gray region shows the healthy region. However, if we construct a decision tree for classifying patients by using guillotine cutting, its subdivision is complicated and the size of the tree becomes very large, and thus, it becomes hard to recognize the substantial rule (see the right chart of Figure 5.2). Therefore, it is very important to propose a better scheme for handling numeric attributes with strong correlations in order to make an efficient diagnostic system based on decision trees. One approach is as follows: Consider each pair of numeric attributes as a two-dimensional attribute. Then, for each such two-dimensional attribute, compute a line partition of the corresponding two-dimensional space so that the corresponding entropy is minimized. One (minor) defect of this method is that it is not cheap to compute the optimal line. Although some work has been done on this problem in computational geometry [AT94, DE93], the worst time complexity remains O(n 2 ) if there are n tuples. Another

121 5.1. INTRODUCTION 111 (major) defect is that the decision tree may still be too large even if we use line partition. Although some multivariate decision tree classifiers [BFOS84, BU95] can find multivariate tests in more practical ways, the latter problem, which is inherent in linear partitioning methods, still remains Main Results In this chapter, we propose the following scheme, applying the two-dimensional association rules given in Chapter 3, which we call region rules for short. The scheme has been implemented as a subsystem of SONAR, which stands for System for Optimized Numeric Association Rules [FMMT96d]. Let n be the number of tuples in the database. First, for each numeric attribute, we create an equi-depth bucketing so that tuples are uniformly distributed into N n ordered buckets according to the values of the attribute. Next, we find all pairs of strongly correlated numeric attributes. For each such a pair A and B, we create an N N pixel grid G according to the Cartesian product of the bucketing of each numeric attribute. We consider a family R of grid regions; in particular, we consider the set R(Rect) of all rectangular regions, R(Xmono) of all x-monotone regions, and R(Convex)of all rectilinear convex regions, all of which are defined in Chapter 3. Figure 3.1 shows example of rectangular, x-monotone, and rectilinear convex regions. Regarding the pair of attributes as a two-dimensional attribute, we compute the region R opt in R that minimizes the entropy function, and consider the decision rule (t[a],t[b]) R opt. We present algorithms for computing R opt in worst-case times of O(nN 2 ) for R(Xmono). Moreover, in practical instances, our algorithms run in O(N 2 log n) time. Since N n, the time complexity is O(n log n). For R(Rect) and R(Convex), the time complexity increases to O(nN 3 ) in the worst case and O(N 3 log n) in practice. Now, we add these rules for all pairs (A, B) of correlated attributes, and construct a decision tree by applying entropy-based heuristics. As a special case of region rules, we also consider rules of the form (t[a] I) for an interval I in order to develop our system. Since the regions separated by guillotine cutting and those separated by

122 112 CHAPTER 5. DECISION TREE line cutting are very special cases of x-monotone regions, our method can find a region rule with smaller entropy values at each step of the heuristic. Hence, we can almost always create a smaller tree. In the above example of height and weight, the rule height 2 < weight < height 2 itself defines an x-monotone region, and hence we can create a nice decision tree of height two, that is, with a root and two leaves. One defect of our approach is that the decision rule (t[a],t[b]) R is sometimes hard to describe. However, we can describe the rule by using a visualization system. Figure 5.3 shows a graphical view of an x-monotone region rule in a decision tree that was constructed from a diabetes diagnosis dataset in the UCI repository [MA94]. The visualization system uses the red color level and brightness to show the characteristics of each pixel. The red level indicates the probability of a positive or negative patient in each pixel, and the brightness indicates the number of patients in each pixel. The data in the node are partitioned according to whether they are in the x-monotone region R opt or not. In this example, the near-triangle region R opt corresponds to the cluster of patients less likely to be positive for diabetes. 5.2 Entropy-Based Data Splitting Entropy of a Splitting Assume that a data set S contains n tuples. To formalize our definition of entropy of splitting, we consider a more general case in which the objective attribute W is a categorical attribute taking values in {1, 2,..., k}. Definition 5.1 The entropy value Ent(S) (with respect to the objective attribute W ) is defined as k Ent(S) = p j log p j, j=1 where p j is the relative frequency with which W takes the value j in the set S. We now consider the entropy function associated with a splitting of the data.

123 5.2. ENTROPY-BASED DATA SPLITTING 113 Figure 5.3: x-monotone region splitting

124 114 CHAPTER 5. DECISION TREE Example 5.1 Suppose that the objective attribute has three categories, say C 1, C 2, and C 3, and that each category has 40, 30, and 30 tuples, respectively. C 1 C 2 C The value of the entropy of the whole data set is log log log =1.09. Let us consider a splitting of the data set into two subsets, S 1 and S 2, with n 1 and n 2 data, respectively, where n 1 + n 2 = n. The entropy of the splitting is defined by Ent(S 1 ; S 2 )= n 1 n Ent(S 1)+ n 2 n Ent(S 2). If we assume that the splitting is as follows: S 1 C 1 C 2 C 3 S 2 C 1 C 2 C , the entropy index value of the dataset after the splitting is ( log log log ) ( log log )=0.80. Therefore, the splitting decreases the value of the entropy by Let us consider another splitting: S 1 C 1 C 2 C 3 S 2 C 1 C 2 C In this case, the value of the associated entropy is 1.075, a decrease of only Definition 5.2 Let f(x) = f(x 1,...,x k ) = k i=1 x i log(x i /s(x)), where s(x) = k i=1 x i. We have Ent(S) = f(p 1,...,p k )= 1 n f(x 1,...,x k ), where n = S and x i = p i n. Thus, Ent(S 1 ; S 2 )= 1 n {f(y 1,...,y k )+f(x 1 y 1,...,x k y k )}, where x i (resp. y i ) is the number of tuples t in S (resp. S 1 ) satisfying t[w ]=i.

125 5.2. ENTROPY-BASED DATA SPLITTING 115 We use the following property of f: Lemma 5.1 The function f(x) is convex in the region x 0 (i.e., x i 0 for i =1, 2,...,k); that is, for any vector X 1,X 2 0. f(x 1 )+f(x 2 ) 2 f( X 1 + X 2 ) 2 Proof For a vector = (δ 1,...,δ k ), let Y ( ) = (,x)= k i=1 δ i x i. In order to show the convexity of f(x), it suffices to show that for any, Since f (X) = f(x) = k f(x) Y ( ) i=1 f (X) = f (X) = 2 f(x) Y ( ) x i δ i, k i=1 where T =(δ1 1,...,δ 1 ). Hence, k f (X) = Now, it suffices to show that δi 1 log x i s(t ) log s(x), k i=1 δi 2 x 1 i s(t ) 2 s(x) 1. (5.1) (5.1) = s(x) i δi 2 x 1 i s(t ) 2 0. We consider vectors A =(x 1/2 1,...,x 1/2 k ) and B =(x 1/2 1 δ1 1,...,x 1/2 k δk 1 ). Then, from the Cauchy-Schwarz inequality, (5.1) = A 2 B 2 (A, B) Region Splitting Given a numeric attribute A, guillotine cutting methods consider the following optimized splitting:

126 116 CHAPTER 5. DECISION TREE Definition 5.3 Let S(A >Z)={t S t[a] >Z} and S(A Z) ={t S t[a] Z} for a real number Z. Let Z opt denote the value that minimizes Ent(S(A >Z); S(A Z)). Then we call the splitting of S into S(A >Z opt ) and S(A Z opt ) guillotine-cut splitting of S. By applying the algorithm described in Section 2.4.3, we can extend the above splitting to the following, which is also considered in our decision tree subsystem of SONAR: Definition 5.4 For an interval I, let S(A I) = {t S t[a] I} and S(A Ī) = {t S t[a] / I}. Let I opt denote the interval that minimizes Ent(S(A I opt ); S(A Īopt)). We call the associated splitting, range splitting of S. We call the above two kinds of splitting one-dimensional rules for short. In this chapter, we consider region splittings. Definition 5.5 Let A 1,...,A N denote buckets of the domain of A, and B 1,...,B N denote buckets of the domain of B (the definition of buckets is given in Definition 2.6). Consider a two-dimensional N N pixel grid G consisting of N N unit squares called pixels, which is generated as a Cartesian product of bucketings. G(i, j) is the (i, j)-th pixel, where i and j are called the row number and column number, respectively. We denote the pixel containing t as q(t). We specify a number N n, and construct an almost equi-depth ordered bucketing of tuples for each numeric attribute. We consider a family R of grid regions of G. Definition 5.6 For each R R, we consider a splitting S into S(R) ={t S q(t) R} and S( R) ={t S q(t) R}, where R = G R is the complement of R. Let R opt be the region of R that minimizes the entropy of the splitting. The region R opt and the associated splitting are called the optimal region and the optimal splitting, orregion rule, with respect to R and the pair of attributes (A, B).

127 5.2. ENTROPY-BASED DATA SPLITTING 117 Definition 5.7 A pixel region is x-monotone if its intersection with each column is undivided (and thus a vertical range). A pixel region is rectilinear convex if both its intersection with each column and its intersection with each row are undivided. R(Xmono) and R(Convex) is the sets of all x-monotone and rectilinear convex regions of G, respectively. In Section 5.3, we present efficient algorithms for computing the optimal splitting with respect to three families of regions, R(Rect), R(Xmono), and R(Convex), when the objective attribute W is Boolean. The construction of a decision tree is top-down, starting from its root in breadth-first fashion. When a new internal node is created, the algorithm first computes all one-dimensional rules for singular numeric attributes, and region rules for all pairs of numeric attributes, together with rules associated with Boolean or categorical conditional attributes. Then it chooses the rule that minimizes the entropy. The decision made at the node is associated with the splitting Selecting Correlated Attributes Even if A and B are not strongly correlated, the region rule associated with the pair (A, B) is better with respect to the entropy value than onedimensional rules on A and B. However, it does not necessarily give a better system for users, since a region rule is more complicated than a onedimensional rule. Indeed, some techniques like a visualization system described in Section 3.5 are necessary to explain a region rule. Hence, it is desirable that a region rule should only be considered for a pair of strongly correlated conditional attributes. We use the entropy value again to decide whether A and B are strongly correlated. For simplicity, we assume that R(Xmono) is used as the family of regions. We compute R opt for the pair (A, B) and its entropy value Ent(S(R opt ); S(R opt )). We also compute the optimum intervals I and J to minimize the entropy of the splitting that corresponds to the rules A(X) I and B(X) J, respectively. We give a threshold α 1 to decide that A and A are strongly correlated

128 118 CHAPTER 5. DECISION TREE if and only if Ent(S) Ent(S(R opt ); S(R opt )) >α Ent(S) min{ent(s(i); S(Ī)),Ent(S(J); S( J))} The choice of the threshold α depends on the application. 5.3 Algorithms Naive Hand-Probing Algorithm From now on, we concentrate on the case in which the objective attribute W is Boolean, although our scheme can be extended to the case in which W is categorical. Therefore, the entropy function is written as Ent(S) = p log p (1 p) log (1 p), where p is the frequency with which the objective attribute which has one of the Boolean value on the set of tuples. We consider the problem of computing R opt in several families of grid regions of G. Note that it is very expensive to compute R opt by examining all elements of R, since the set R(Xmono), for example, has N N different regions. Let n 1 and n 2 be the numbers of tuples t of S satisfying t[w ] = 0 and t[w ] = 1, respectively. For a region R, let x(r) and y(r) be the number of tuples t located in the pixels in R that satisfy t[w ] = 0 and t[w ]=1, respectively. Definition 5.8 Given a family of regions R, let P be a set of planar points {ι(r) =(x(r),y(r)) R R}. We denote its convex hull as conv(p ). Since x(r) and y(r) are nonnegative integers that are at most n, P contains O(n 2 ) points, and conv(p ) has at most 2n points on it. We define E(x, y) = f(x, y)+f(n 1 x, n 2 y), n using the function f defined in Definition 5.2 for X =(x, y). Then, the entropy function Ent(S(R); S( R)) of the splitting is E(ι(R)) = E(x(R),y(R)).

129 5.3. ALGORITHMS 119 Lemma 5.2 ι(r opt ) must be on conv(p ). Proof From Lemma 5.1, f(x, y) is convex, and hence E(x, y) is a concave function. It is well known that the minimum of a concave function over P is taken at an extremal point, that is, a vertex of conv(p ). Hence, naively, it suffices to compute all the vertices of conv(p ) and their associated partition curves. Our problem now resembles global optimization problems [PR90]. In global optimization, extremal points can be computed by using linear programming. However, we know neither the point set P nor the constraint inequalities defining the convex hull; hence, we cannot use the linear programming approach in a straightforward manner. Definition 5.9 Let conv + (P ) (resp. conv (P )) be the upper (resp. lower) chain of conv(p ); Here, we consider that the leftmost (resp. rightmost) vertex of conv(p ) belongs to the upper (resp. lower) chain. Our algorithm is based on the use of what is known in computational geometry as hand-probing to compute the vertices of a convex polygon [DEY86]. Hand-probing is based on the touching oracle: Given a slope θ, compute the tangent line with slope θ to the upper (resp. lower) chain of the convex polygon together with the tangent point v + (θ) (resp. v (θ)). If the slope coincides with the slope of an edge of the polygon, the left vertex of the edge is reported as the tangent point. Lemma 5.3 If a touching oracle is given in O(T ) time, all vertices of conv(p ) can be computed in O(nT ) time. Proof We consider an interval I =[I(left),I(right)] of the upper chain of conv(p ) between two vertices I(left) and I(right) in Figure 5.4. We start with θ =, find the leftmost vertex p 0 and the rightmost vertex p 1 of conv(p ), and set I(left) =p 0 and I(right) =p 1. Let θ(i) be the slope of the line through points I(left) and I(right). We perform the touching oracle and find I(mid) =v + (θ I ). If I(mid) =I(left), we report that I corresponds

130 120 CHAPTER 5. DECISION TREE I(left) Q(I) I(mid) I I(right) y x Figure 5.4: Hand-probe to an edge of conv(p ), and hence no other vertex exists there. Otherwise, we divide I into [I(left),I(mid)] and [I(mid),I(right)], and process each subinterval recursively. We find either a new vertex or a new edge by executing the touching oracle in the algorithm. Hence, the time complexity is O( P T ), where P n is the number of vertices of P. Lemma 5.4 For a given θ, the touching oracle to conv(p ) can be computed in O(N 2 ) time for R(Xmono), O(N 3 ) time for R(Convex) and R(Rect). Proof It suffices to show how to compute v + (θ), since v (θ) can be analogously computed. Let v + (θ) =(x(r θ ),y(r θ )), and let the tangent line be y θx = a. Then, y(r θ ) θx(r θ )=a and y(r) θx(r) a for any R R. Hence, R θ is the region that maximizes y(r) θx(r). Let u i,j be the number of tuples in the (i, j)-th pixel of G, and let v i,j be the number of tuples satisfying w(t) = 1 in the (i, j)-pixel. We write gain i,j (θ) =v i,j θu i,j. From our definition, y(r) θ(x(r)+y(r)) = (i,j) R gain i,j (θ). If R = R(Xmono) orr = R(Convex), R θ is the optimized-gain region defined in Definition 3.10, and can be computed in O(N 2 ) and O(N 3 ) time respectively by using the algorithms given in Chapter 3.

131 5.3. ALGORITHMS 121 Combining Lemmas 5.2, 5.3, and 5.4, we have the following theorem: Theorem 5.1 R opt can be computed in O(nN 2 ) time for R(Xmono), and O(nN 3 ) time for R(Convex) and R(Rect). These are the worst case time complexities. In the next section, we further improve the practical time complexity by a factor of O(n/ log n) Guided Branch-and-Bound Search The hand-probing algorithm computes all vertices on the convex hull. However, we only need to compute the vertex corresponding to R opt. Hence, we can improve the performance by pruning unnecessary vertices efficiently. While running the hand-probing algorithm, we maintain the current minimum E min of the entropy values corresponding to the vertices examined so far. Suppose we have done hand probing with respect to θ l and θ r, and next consider the interval I =[v + (θ l ),v + (θ r )] = [I(left),I(right)] of conv + (P ). Let Q(I) =(x Q(I),y Q(I) ) in the Figure 5.4 be the point of intersection of the tangent lines whose slopes are θ l and θ r. We compute the value E(Q(I)) = E(x Q(I),y Q(I) ). If the two tangent lines are parallel, we set E(Q(I)) =. Lemma 5.5 For any point Q inside the triangle I(left)I(right)Q(I), E(Q ) min{e(q(i)),e min }. Proof Immediate from the concavity of E. This lemma gives a lower bound for the values of E at the vertices between I(left) and I(right) inconv + (P ). Hence, we have the following: Corollary 5.1 If E(Q(I)) E min, no vertex in the interval I of conv + (P ) corresponds to a region whose associated entropy is less than E min. On the basis of Corollary 5.1, we can find the optimal region R opt effectively by running the hand-probing algorithm together with the branchand-bound strategy guided by the values E(Q(I)). Indeed, the algorithm

132 122 CHAPTER 5. DECISION TREE examines the subinterval with the minimum value of E(Q(I)) first. Moreover, during the process, subintervals satisfying E(Q(I)) E min are pruned away. We maintain the list {E(Q(I)) I I}, using a priority queue. Note that E min monotonically decreases, while Q min monotonically increases in the algorithm. Most subintervals are expected to be pruned away during the execution, and the number of touching oracles in the algorithm is expected to be O(log n) in practical instances. We have implemented the algorithm as a subsystem of SONAR, and confirmed the expected performance by experiment as described in Section 5.4. Since the touching oracle needs O(N 2 ) time for R(Xmono), the algorithm runs experimentally in O(N 2 log n) time, which is O(n log n), because N n. Similarly, it runs in O(N 3 log n) time for R(Convex) and R(Rect). 5.4 Performance In this section, we examine several performance experiments. The algorithms in Section 5.3 have been written in C++, and implemented as subsystem of our Database SONAR (System for Optimized Numeric Association Rules) prototype. First we examine the time taken to compute the optimal region. After verifying its scalability, we then examine the overall time taken to construct a decision tree with region splitting Computing Optimal Regions The performance of a method for constructing decision trees depends on the time taken to compute the optimal regions for splitting data. Thus the cost of computing one optimal region gives an idea of the method s overall performance in generating one decision tree. In the first experiments, we generated our test data, in the form of an N N grid, as follows: We first generated random numbers uniformly distributed in [N 2, 2N 2 ] and assigned them to the number of tuples in each pixel. We then assigned a value in [1,N 2 ] to the number of tuples that take one of the

133 5.4. PERFORMANCE 123 x-monotone Rectilinear Rectangular Resolution sec #touch sec #touch sec #touch Table 5.2: Performance in computing the optimal region Boolean values of the objective attribute, from a corner pixel to the central one, proceeding in a spiral fashion. These test data were generated so that the number of points on the convex hull increases sub-linearly to N, the square root of the number of pixels. We examined the CPU time taken to compute the optimal regions and the number of touching oracles needed to find the regions. All the experiments were performed on an IBM RS/6000 workstation with a 112 MHz PowerPC 604 chip and 512 MB of main memory, running under the AIX 4.1 operating system. Table 5.2 shows the time and the number of touching oracles that were required in the guided branch-and-bound algorithm to find the optimal x- monotone (resp. rectilinear convex, or rectangular) region that minimizes entropy. It shows that the number of touching oracles increases very slowly thanks to the guided branch-and-bound algorithm. Figure 5.5 confirms that the CPU time follows our scale estimation. Although the asymptotic time complexity for computing the optimal x-monotone region is better than that for computing the optimal rectangular region, in practice the optimal rectangular region is computed faster when the grid is not so large, because the constant factor is smaller. At the root of a decision tree, we may need a large number of pixels to guarantee the specified pixel density. The number of tuples, however, decreases in lower parts of the tree, and the number of pixels soon becomes less than for most datasets; hence, computing the optimal rectilinear convex region is not costly, according to Table 5.2.

134 124 CHAPTER 5. DECISION TREE X-monotone Rectangular Rectilinear Time [sec] ^2 80^2 120^2 160^2 200^2 240^2 Resolution Figure 5.5: CPU time for computing the optimized region Computing Decision Trees The next experiment examines the overall performance in tree construction. At each node of a decision tree, we first prepare the grid, and then compute the optimal region. The grid preparation is not expensive, because it can be done by scanning all the tuples at a node just once. The problem is that we have to calculate the optimal regions for all pairs (permutations for x- monotone regions) of two distinct numeric attributes. Thus the number of attributes dramatically affects the overall performance. Table 5.3 compares the time taken to construct trees by using datasets with different numbers of tuples. We randomly selected tuples from the waveform dataset to generate datasets having different numbers of tuples, and used those datasets to construct decision trees by performing region splitting with a pixel density of 5 and conventional guillotine cutting. We used the first eight numeric attributes as the conditional attributes, in order to simplify the experiment, and compared the time taken to construct pruned trees. The results show that the tree construction time is a little more than our scale estimation, because trees from larger datasets tend to become bigger. Table 5.4, on the other hand, compares the performance using datasets

135 5.5. APPLICATION TO CREDIT RISK ANALYSIS 125 #tuples x-monotone Rectilinear Guillotine Table 5.3: Tree construction time (1) #attributes x-monotone Rectilinear Guillotine Table 5.4: Tree construction time (2) with different numbers of attributes. We used 3,000 tuples from the waveform dataset, and constructed trees using the first N numerical attributes. Observe that the time complexity is almost linear in the square of the number of attributes. In case of guillotine cutting trees, the test optimization cost is so small that time taken to construct a tree depends mainly on its final tree size. 5.5 Application to Credit Risk Analysis We applied our method to an application of credit risk analysis. Table 5.5 shows parts of the financial statements of some Japanese companies from 1992 to It contains 69 numeric attributes such as ID (ID of a company), Net Income / Sales, and Equity Ratio. It also contains the attribute Default, which shows whether the company defaulted within a year after the financial statements was observed. We want to construct a prediction model for predicting the probability of bankruptcy of unseen com-

136 126 CHAPTER 5. DECISION TREE ID Net Income Equity... Default / Sales (%) Ratio (%) xxx N xxx N xxx D xxx N xxx D Table 5.5: Financial statements of companies No. of No. of No. of companies defaulting non-defaulting S: All data S 1 : Inside S 2 : Outside Table 5.6: Splitting by region panies. In order to discover rules about the value of Default, we collected 1,036 samples of defaulting companies and another 1,036 of non-defaulting companies. Figure 5.6 shows the most important two-dimensional rectilinear convex region rule that was found from the table. Note that there are 69 attributes in the table and more than 2,300 attribute pairs. We examined each pair to find the optimal region, and chose the most important one on the basis of the entropy value. The region on the plane whose x-axis is EBIT / Sales and whose y-axis is Equity Ratio divides the original data S into two subsets S 1 (inside the region) and S 2 (outside the region) as shown in Table 5.6. We divided the data recursively by using two-dimensional rules, and constructed a decision tree to evaluate the credit risk of each company. Figure 5.7 shows the decision tree constructed from the example. In each leaf node (square node), the ratio of defaulting companies is large (or small)

137 5.6. CONCLUSIONS 127 Equity Ratio All companies N D EBIT / Sales Companies inside the region Companies outside the region N D N D Figure 5.6: Rules for analyzing credit risk enough to judge the companies in that node. Each company whose credit risk is unknown can be classified into one of the leaf nodes in the tree, and we can estimate the credit risk according to the ratio of defaulting companies in the leaf node. Compared with the conventional decision tree based on one-dimensional binary partition rules, our tree improves the default prediction error in this instance by a ratio of nearly 10%. Such a difference has a large impact in this application. Moreover, our method yields about 60% smaller trees than the conventional methods. 5.6 Conclusions We have presented an entropy-based greedy method of constructing decision trees by using region rules. Although, in order to use the region splitting, we require an additional computation time proportional to the number of numeric conditional attributes, the improvements will be worth the computational cost if there are not too many numeric attributes in many applications.

138 128 CHAPTER 5. DECISION TREE Equity Ratio EBIT/Sales R1 Assets / Liabilities Equity Ratio R2 Inside R1 Outside R1 Inside R2 Outside R2 N D N D Inside R1 (Predicted as "N") Outside R1 & Outside R2 (Predicted as "D") Figure 5.7: Decision tree for analyzing credit risk

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization