INTEGRATING DATA CUBE COMPUTATION AND EMERGING PATTERN MINING FOR MULTIDIMENSIONAL DATA ANALYSIS

INTEGRATING DATA CUBE COMPUTATION AND EMERGING PATTERN MINING FOR MULTIDIMENSIONAL DATA ANALYSIS by Wei Lu a Report submitted in partial fulfillment of the requirements for the SFU-ZU dual degree of Bachelor of Science in the School of Computing Science Simon Fraser University and the College of Computer Science and Technology Zhejiang University c Wei Lu 2010 SIMON FRASER UNIVERSITY AND ZHEJIANG UNIVERSITY April 2010 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author.

APPROVAL Name: Degree: Title of Report: Wei Lu Bachelor of Science Integrating Data Cube Computation and Emerging Pattern Mining for Multidimensional Data Analysis Examining Committee: Dr. Jian Pei Associate Professor, Computing Science Simon Fraser University Supervisor Dr. Qianping Gu Professor, Computing Science Simon Fraser University Supervisor Dr. Ramesh Krishnamurti Professor, Computing Science Simon Fraser University SFU Examiner Date Approved: ii

Abstract Online analytical processing (OLAP) in multidimensional text databases has recently become an effective tool for analyzing text-rich data such as web documents. In this capstone project, we follow the trend of using OLAP and the data cube to analyze web documents, but want to address a new problem from the data mining perspective. In particular, we wish to find contrast patterns in documents of different classes and then use those patterns in OLAP style text data and web document analysis. To this end, we propose to integrate the data cube with an important kind of contrast pattern called the emerging pattern, to build a new data model for solving the document analysis problem. Specifically, this novel data model is implemented on top of the traditional data cube by seamlessly integrating the bottom-up cubing (BUC) algorithm with two different emerging pattern mining algorithms, the Border-Differential and the DPMiner. The processes of cube construction and emerging pattern mining are merged together and carried out simultaneously; patterns are stored into the cube as cell measures. Moreover, we study and compare the performance of those two integrations by conducting experiments on datasets derived from the Frequent Itemset Mining Implementations Repository (FIMI). Finally, we suggest improvements and optimizations that can be done in future work. iii

iv To my family

For those who believe, no proof is necessary; for those who don t believe, no proof is possible. Stuart Chase, Writer and Economist, 1888 v

Acknowledgments First of all, I would like to express my deepest appreciation to Dr. Jian Pei, for his support and guidance during my studies at Simon Fraser University. In various courses I took with him and particularly this capstone project, Dr. Pei showed me his broad knowledge and deep insights in the area of data management and mining, as well as his great personality and patience to a research beginner like me. In his precious time, he provided me with lots of help and advice for the project and other concerns (especially my graduate school applications). This work would not be possible without his supervision. I would love to thank Dr. Qianping Gu and Dr. Ramesh Krishnamurti for reviewing my report and directing the capstone projects for this amazing dual degree program. My gratitude also goes to Dr. Ze-Nian Li, Dr. Stella Atkins, Dr. Greg Mori and Dr. Ted Kirkpatrick for their wonderful classes I took at SFU and their good advice for my studies and career development. Also thanks to Dr. Guozhu Dong at Wright State University and Dr. Guimei Liu at National University of Singapore for making useful resources available for my work. I would also like to thank Mr. Thusjanthan Kubendranathan at SFU for his time and help in our discussions about this project. Deepest gratefulness to my family and friends who make my life enjoyable. In particular, I am greatly indebted to my beloved parents, for their unconditional support and encouragement. Their love accompany me wherever I go. This work is dedicated to them and I hope they are proud of me, as I am always proud of them. vi

Contents Approval Abstract Dedication Quotation Acknowledgments Contents List of Tables List of Figures ii iii iv v vi vii x xi 1 Introduction 1 1.1 Overview of Text Data Mining.......................... 1 1.2 Related Work on Multidimensional Text Data Analysis............. 2 1.3 Contrast Pattern Based Document Analysis................... 3 1.4 Structure of the Report.............................. 5 2 Literature Review 6 2.1 Data Cubes and Online Analytical Processing.................. 6 2.1.1 An Example of The Data Cube...................... 7 2.1.2 Data Cubing Algorithms.......................... 8 2.1.3 BUC: Bottom-Up Computation for Data Cubing............ 9 vii

2.2 Frequent Pattern Mining.............................. 10 2.3 Emerging Pattern Mining............................. 11 2.3.1 The Border-Differential Algorithm.................... 12 2.3.2 The DPMiner Algorithm......................... 13 2.4 Summary...................................... 14 3 Motivation 15 3.1 Motivation for Mining Contrast Patterns..................... 15 3.2 Motivation for Utilizing The Data Cube..................... 16 3.3 Summary...................................... 17 4 Our Methodology 18 4.1 Problem Formulation................................ 18 4.1.1 Normalizing Data Schema for Text Databases.............. 18 4.1.2 Problem Modeling with Normalized Data Schema........... 19 4.2 Processing Framework............................... 20 4.2.1 Integrating Data.............................. 21 4.2.2 Integrating Algorithms........................... 22 4.3 Implementations.................................. 24 4.3.1 BUC with DPMiner............................ 25 4.3.2 BUC with Border-Differential and PADS................ 25 4.4 Summary...................................... 25 5 Experimental Results and Performance Study 26 5.1 The Test Dataset.................................. 26 5.2 Comparative Performance Study and Analysis.................. 27 5.2.1 Evaluating the BUC Implementation................... 27 5.2.2 Comparing Border-Differential with DPMiner.............. 28 5.3 Summary...................................... 30 6 Conclusions 31 6.1 Summary of The Project.............................. 31 6.2 Limitations and Future Work........................... 31 Bibliography 33 viii

Index 36 ix

List of Tables 2.1 A base table storing sales data [15]......................... 7 2.2 Aggregates computed by group-by Branch.................... 7 2.3 The full data cube based on Table 2.1....................... 8 2.4 An sample transaction database [21]........................ 10 3.1 A multidimensional text database concerning Olympic news.......... 17 4.1 A normalized dataset derived from the Olympic news database......... 19 4.2 A normalized dataset reproduced from Table 4.1................. 21 5.1 Sizes of synthetic datasets for experiments.................... 27 5.2 The complete experimental results......................... 30 x

List of Figures 2.1 BUC Algorithm [5, 27]............................... 9 2.2 A example of FP-tree based on Table 2.4 [21]................... 11 5.1 Running time and cube size of our BUC implementation............ 28 5.2 Comparing running time of the two integration algorithms........... 29 xi

Chapter 1 Introduction 1.1 Overview of Text Data Mining Analysis of documents in text databases and on the World Wide Web has been attracting researchers from various areas, such as data mining, machine learning, information retrieval, database systems, and natural language processing. In general, studies in different areas have different emphases. Traditional information retrieval techniques (e.g., the inverted index and vector-space model) prove to be efficient and effective in searching relevant documents to answer unstructured keyword-based queries. Machine learning approaches are also widely used in text mining, providing with effective solutions to various problems. For example, the Naive Bayes model and the Support Vector Machines (SVMs) are used in document classification; K-means and the Expectation- Maximization (EM) algorithms are used in document clustering. The textbook by Manning et al. [19] covers topics summarized above and much more in both traditional information retrieval and machine learning based document analysis. On the other hand, data warehousing and data mining also play important roles in analyzing documents, especially those stored in a special kind of databases called multidimensional text databases (ones with both relational dimensions and text fields). While information retrieval mainly addresses searching for documents and for information within documents according to users information needs, the goal of text mining differs in the following sense: it focuses on finding and extracting useful patterns and hidden knowledge from the information in documents and/or text databases, so as to improve the decision making process based on the text information. 1

CHAPTER 1. INTRODUCTION 2 Currently, many real-life business, administration and scientific databases are multidimensional text databases, containing both structured attributes and unstructured text attributes. An example of these databases can be found in Table 3.1. Since data warehousing and online analytical processing (OLAP) have proven their great usefulness in managing and mining multidimensional data of varied granularities [11], they have recently become important tools in analyzing such text databases [6, 17, 24, 26]. 1.2 Related Work on Multidimensional Text Data Analysis A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making [13]. Online analytical processing (OLAP), which is dominated by stylized queries that involve group-by and aggregate operators [27], is a powerful tool in data warehousing. Being a multidimensional data model with various features, the data cube [10] has become an essential OLAP facility in data warehousing. Conceptually, the data cube is an extended database with aggregates in multiple levels and multiple dimensions [15]. It generalizes the group-by operator, by precomputing and storing group-bys with regard to all possible combinations of dimensions. Data cubes are widely used in data warehousing for analyzing multidimensional data. Applying OLAP techniques, especially data cubes, to analyze documents in multidimensional text databases has made significant advances. Important information retrieval measures, i.e., term frequencies and inverted indexes, have been integrated into the traditional data cube, leading to the text cube [17]. It explores both dimension hierarchy and term hierarchy in the text data, and is able to answer OLAP queries by navigating to a specific cell via roll-up and drill-down operations. More recently, the work in [6] proposes a query answering technique called TopCells to address the top-k query answering in the text cube. Given a keyword query, TopCells is able to find the top-k ranked cells containing aggregated documents that are most relevant to the query. Another OLAP-based model dealing with multidimensional text data is the topic cube [26]. Topic cube combines OLAP with probabilistic topic modeling. It explores topic hierarchy of documents and stores probability-based measures learned through a probabilistic model. Moreover, text cubes and topic cubes have been applied to information network analysis. They are combined into an information-network-enhanced text cube called inextcube

CHAPTER 1. INTRODUCTION 3 [24]. Most previous works emphasize data warehousing more than data mining. They mainly deal with problems such as how to explore and establish dimensional hierarchies within the text data, and how to efficiently answer OLAP queries using cubes built on text data. 1.3 Contrast Pattern Based Document Analysis We follow the trend of using data cubes to analyze documents in multidimensional text databases. But as the previous works are more data warehousing oriented, we intend to address a more data mining oriented problem called contrast pattern based document analysis. More specifically, we wish to find contrast patterns in documents of different classes and then use those patterns in OLAP style document analysis (like the work in [6, 17]). This application is promising and has real-life demands. For example, from a large collection of documents containing information and reviews of laptop computers of various brands, a user interested in comparing Dell and Sony laptops might wish to find text information describing Dell s special features that do not characterize Sony. These features contrast the two brands effectively, and would probably make the user s decision to select Dell easier. To achieve this goal, we propose to integrate frequent pattern mining, especially the emerging pattern mining, and data cubing in an efficient and effective way. Frequent pattern mining [2] aims to find itemsets, that is, sets of items that frequently occur in a dataset. Furthermore, for patterns that can contrast different classes of data, intuitively they must be frequent patterns in one class, but are comparatively infrequent in other classes. There is one important class of contrast patterns, called the emerging pattern [7], defined as itemsets whose supports increase significantly from dataset D 1 to dataset D 2. That said, those patterns are frequent in D 2 but infrequent in D 1. Because of the sharp change of their supports among different datasets, such patterns meet our needs of showing contrasts in different classes of web documents. Our Contributions To tackle the contrast pattern based document analysis problem, we propose a novel data model by integrating efficient emerging pattern algorithms (e.g., the Border-Differential [7]

CHAPTER 1. INTRODUCTION 4 and the state-of-the-art, DPMiner [16]) with the traditional data cube. This integrated model is novel, but also preserves features of traditional data cubes: 1. It is based on the data cube, and is constructed through a classical data cubing algorithm called BUC (the Bottom-Up Computation for data cubing) [5]. 2. It contains multidimensional text data and multiple granularity aggregates of such data, in order to support fast OLAP operations (such as roll-up and drill-down) and query answering. 3. Each cell in the cube contains a set of aggregated documents in the multidimensional text database with matched dimension attributes. 4. The measure of each cell is the emerging patterns whose support rises rapidly from the documents not aggregated in the cell to the documents aggregated in the cell. In this capstone project, we implement this integrated data model by incorporating emerging pattern mining seamlessly into the data cubing process. We choose BUC as our cubing algorithm to build the cube on structured dimensions. While aggregating documents and materializing cells, we simultaneously mine emerging patterns in documents aggregated in each particular cell, and store such patterns as the measure of this cell. Two widely used emerging pattern mining algorithms, the Border-Differential and the DPMiner are integrated with BUC cubing so as to compare their performance. We tested these two different integrations on synthetic datasets to evaluate their performance on different sizes of input data. The datasets are derived based on the Frequent Itemset Mining Implementations Repository (FIMI) [9]. Experimental results show that the state-of-the-art emerging pattern mining algorithm, the DPMiner, is a better choice over the Border-Differential. Our cube-based model shares similarity with the text cube [17] and the topic cube [26] at the level of data structure, since all three cubes are built based on multidimensional text data. The similarity of cube-based structure allows OLAP query answering techniques developed in [6, 17, 24, 26] to be directly applied to our cube. In that sense, point queries (seeking a cell), sub-cube queries (seeking an entire group-by) and top-k queries (seeking k most relevant cells) can be answered in contrast pattern based document analysis using our model.

CHAPTER 1. INTRODUCTION 5 Major Differences with Existing Works This cube-based data model with emerging patterns as cell measures differs from all previous related work. It is unlike traditional data cubes using simple aggregate functions as cell measures, which are only adequate for relational databases. Also, our approach differs from the text cube which uses term frequencies and inverted indexes as cell measures, and the topic cube which uses probabilistic measures. Most importantly, to the best of our knowledge, our data model is novel in comparison to previous emerging pattern applications in OLAP. Specifically, a previous work in [20] used the Border-Differential algorithm to perform cube comparisons and capture trend changes between two precomputed data cubes. However, that work is of limited use and cannot be applied to multidimensional text data analysis. First, their approach worked on datasets different in kind from ours. The previous method only works on traditional data cubes built upon relational databases with categorical dimension attributes, while ours is designed for multidimensional text databases. Second, their approach is to find cells with supports growing significantly from one cube to another, but ours is able to determine emerging patterns for every single cell in the cube. Last but not least, their approach performs the Border-Differential algorithm after two data cubes were completely built, but our approach introduces a seamless integration: the data cubing and emerging pattern mining are carried out simultaneously. 1.4 Structure of the Report The rest of this capstone project report is organized as follows: Chapter 2 conducts a literature review on previous work and background knowledge that lays the foundation for this project. Chapter 3 motivates the contrast pattern based document analysis problem. Chapter 4 describes our methodology to tackle the problem. This chapter formulates the problem and proposes algorithms for constructing the integrated data model. Chapter 5 reports experimental results and studies the performance of our algorithm. Lastly, Chapter 6 concludes this capstone project and suggests improvements and optimizations that can be done in future work.

Chapter 2 Literature Review This chapter reviews three categories of previous research that are related to this capstone project: data cubes and OLAP, frequent pattern mining, and emerging pattern mining. In Section 2.1 we talk about fundamentals of data warehousing, online analytical processing (OLAP), and data cubing. We highlight BUC [5], a bottom-up approach for data cubing. Section 2.2 introduces frequent pattern mining and an important mining algorithm called FP-Growth [12]. Section 2.3 reviews emerging pattern mining algorithms (Border- Differential [7] and DPMiner [16]) that are particularly useful to our work. 2.1 Data Cubes and Online Analytical Processing A data warehouse is a subject oriented, integrated, time-varying, non-volatile collection of data in support of management s decision-making process [13]. A powerful tool of exploiting data warehouses is the so-called online analytical processing (OLAP). Typically, OLAP systems are dominated by stylized queries involving many group-by and aggregation operations [27]. The data cube was introduced in [10] to facilitate answering OLAP queries on multidimensional data stored in data warehouses. A data cube can be viewed as an extended multi-level and multidimensional database with various multiple granularity aggregates [15]. The term data cubing refers to the process of constructing a data cube based on a relational database table, which is often referred to as the base table. In a cubing process, cells with non-empty aggregates will be materialized. Given a base table, we precompute group-bys and the corresponding aggregate values with respect to all possible combinations 6

CHAPTER 2. LITERATURE REVIEW 7 of dimensions in this table. Each group-by corresponds to a set of cells. The aggregate value for that group-by is stored as the measure of that cell. Cell measures provide with a good and concise summary of information aggregated in the cube. In light of the above, the data cube is a powerful data model allowing fast retrieval and analysis of multidimensional data for decision making processes based on data warehouses. It generalizes the group-by operator in SQL (Structured Query Language), and enable data analysts to avoid long and complicated SQL queries when searching for unusual data patterns in multidimensional databases [10]. 2.1.1 An Example of The Data Cube Example (Data Cube): Table 2.1 is a sample base table in a marketing management data warehouse [15]. It shows data organized under the schema (Branch, Product, Season, Sales). Branch Product Season Sales B 1 P 1 spring 6 B 1 P 2 spring 12 B 2 P 1 fall 9 Table 2.1: A base table storing sales data [15]. To build a data cube upon this table, group-bys are computed on three dimensions Branch, Product and Season. Aggregate values of Sales will be cell measures. In this example, we choose Average(Sales) as the aggregate function for this example. Since most intermediate steps of a data cubing process are basically computing group-bys and aggregate values to form cells, we illustrate the two cells computed by group-by Branch in Table 2.2. Cell No. Branch Product Season AVG(Sales) 1 B 1 9 2 B 2 9 Table 2.2: Aggregates computed by group-by Branch. In the same manner, the full data cube contains all possible group-bys on Branch, Product and Season. It is shown in Table 2.3. Note that cells 1, 2 and 3 are derived from the least aggregated group-by: group-by Branch, Product, Season. Such cells are

CHAPTER 2. LITERATURE REVIEW 8 called base cells. On the other hand, cell 18 (,, ) is the apex cuboid aggregating all tuples in the base table. Cell No. Branch Product Season AVG(Sales) 1 B 1 P 1 spring 6 2 B 1 P 2 spring 12 3 B 2 P 1 fall 9 4 B 1 P 1 6 5 B 1 P 2 12 6 B 1 spring 9 7 B 2 P 1 9 8 B 2 fall 9 9 P 1 spring 6 10 P 1 fall 9 11 P 2 spring 12 12 spring 9 13 fall 9 14 B 1 9 15 B 2 9 16 P 1 7.5 17 P 2 12 18 9 Table 2.3: The full data cube based on Table 2.1. 2.1.2 Data Cubing Algorithms Efficient and scalable data cubing is challenging. When a base table has a large number of dimensions and each dimension has high cardinality, time and space complexity grows exponentially. In general, there are three approaches of cubing in terms of the order to materialize cells: top-down, bottom-up and a mix of both. A top-down approach (e.g., the Multiway Array Aggregation [28]) constructs the cube from the least aggregated base cells towards the most aggregated apex cuboid. On the contrary, a bottom-up approach such as BUC [5] computes cells in the opposite order. Other methods, such as Star-Cubing [23], combines the top-down and bottom-up mechanisms together to carry out the cubing process. On fast computation of multidimensional aggregates, [11] summarizes the following optimization principles: (1). Sorting or hashing dimension attributes to cluster related tuples

CHAPTER 2. LITERATURE REVIEW 9 that are likely to be aggregated together in certain group-bys. (2). Computing higher-level aggregates from previously computed lower-level aggregates, and caching intermediate results in memory to reduce expensive I/O operations. (3). Computing a group-by from the smallest previously-computed group-by. (4). Mapping dimension attributes in various kinds of formats to integers ranging between zero and the cardinality of the dimension. There are also many other heuristics being proposed to improve the efficiency of data cubing [1, 5, 11]. 2.1.3 BUC: Bottom-Up Computation for Data Cubing BUC [5] constructs the data cube bottom-up, from the most aggregated apex cuboid to group-bys on a single dimension, then on a pair of dimensions, and so on. It also uses many optimization techniques introduced in the previous section. Figure 2.1 illustrates the processing tree and the partition method used in BUC on a 4-dimensional base table. Subfigure (b) shows the recursive nature of BUC: after sorting and partitioning data on dimension A, we deal with the partition (a 1,,, ) first and recursively partition it on dimension B to proceed to its parent cell (a 1, b 1,, ) and then the ancestor (a 1, b 1, c 1, ) and so on. After dealing with partition a 1, BUC continues on to process partitions a 2, a 3 and a 4 in the same manner until all cells are materialized. Figure 2.1: BUC Algorithm [5, 27]. The depth-first search process for building our integrated data model (covered in Chapter

CHAPTER 2. LITERATURE REVIEW 10 4) follows the basic framework of BUC. 2.2 Frequent Pattern Mining Frequent patterns are patterns (sets of items, sequence, etc.) that occur frequently in a database [2]. The supports of frequent patterns must exceed a pre-defined minimal support threshold. Frequent pattern mining has been studied extensively in the past two decades. It lays the foundation for many data mining tasks such as association rules [3] and emerging pattern mining. Although its definition is concise, the mining algorithms are not trivial. Two notable algorithms are Apriori [3] and FP-Growth [12]. FP-Growth is more important to our work as efficient emerging pattern mining algorithms such as [4, 16] use the FP-tree proposed in FP-Growth as data structures. FP-Growth addressed the limitations of the breadth-first-search-based Apriori such as multiple database scans, large amounts of candidate generations and support counting. It is a depth-first search algorithm. The first scan of a database finds all frequent items, ranks them in frequency-descending order, and puts them into a head table. Then it compresses the database into a prefix tree called FP-tree. A complete set of frequent patterns can be mined by recursively constructing projected databases and the FP-trees based on them. For example, given a transaction database in Table 2.4 [21], we can build a FP-tree accordingly (shown in Figure 2.2). TID Items (Ordered) Frequent Items 100 f, a, c, d, g, i, m, p f, a, c, m, p 200 a, b, c, f, l, m, o f, c, a, b, m 300 b, f, h, j, o f, b 400 b, c, k, s, p c, b, p 500 a, f, c, e, l, p, m, n f, c, a, m, p Table 2.4: An sample transaction database [21]. Next, we define three special types of frequent patterns: the maximal frequent patterns (max-patterns for short), the closed frequent patterns and frequent generators, as they are closely related to emerging pattern mining. Definition (Max-Pattern): An itemset X is a maximal frequent pattern, or maxpattern, in dataset D if X is frequent in D, and for every proper super-itemset Y such that

CHAPTER 2. LITERATURE REVIEW 11 Figure 2.2: A example of FP-tree based on Table 2.4 [21]. X Y, Y is infrequent in D [11]. Definition (Closed Pattern and Generator): An itemset X is closed in dataset D if there exists no proper super-itemset Y s.t. X Y and support(x) = support(y ) in D. X is a closed frequent pattern in D if it is both closed and frequent in D [11]. An itemset Z is a generator in D if there exists no proper sub-itemset Z such that Z Z and support(z ) = support(z) [18]. The state-of-the-art max-pattern mining algorithm is called the Pattern-Aware Dynamic Search (PADS) [25]. The DPMiner, the state-of-the-art emerging pattern mining algorithm, is also the most powerful algorithm for mining closed frequent patterns and frequent generators. 2.3 Emerging Pattern Mining Emerging patterns [7] are patterns whose supports increase significantly from one class of data to another. Mathematical details can be found Section 4.1 (Problem Formulation) of this report and [4, 7, 8, 16]. The original work on emerging pattern in [7] gives an algorithm called the Border-Differential for mining such patterns. It uses borders to succinctly represent patterns and mines the patterns by manipulating the borders only. The work in [4]

CHAPTER 2. LITERATURE REVIEW 12 used the FP-tree introduced in [12] for emerging pattern mining. Following that, the work in [16] improves the FP-tree-based algorithm by simultaneously generating closed frequent patterns and frequent generators to form emerging patterns. This algorithm is called the DPMiner and is considered as the state-of-the-art for emerging pattern mining. 2.3.1 The Border-Differential Algorithm Border-Differential uses borders to represent patterns. It involves mining max-patterns and manipulating borders initiated by the patterns to derive the border representation of emerging patterns. A border is an ordered pair L, R, where L and R are the left and right bounds of the border respectively. Both L and R are collections of itemsets, but are much smaller than the original patterns in size. Emerging patterns represented by L, R are the intervals of L, R, defined as [L, R] = {Y X L, Z R, s.t. X Y Z}. For example, suppose [L, R] = {{1}, {1, 2}, {1, 3}, {1, 2, 3}, {2, 3}, {2, 3, 4}}, it has border L = {{1}, {2, 3}}, R = {{1, 2, 3}, {2, 3, 4}}. Itemsets other than those in L and R (e.g., {1, 3}) are intervals of L, R. Given a pair of borders {φ}, R 1 and {φ}, R 2 whose left bounds are initially empty, the differential border L 1, R 1 is derived to satisfy [L 1, R 1 ] = [{φ}, R 1 ] [{φ}, R 2 ]. This operation is the so-called Border-Differential. Furthermore, given two datasets D 1 and D 2, to determine emerging patterns using the Border-Differential operation, first we determine the max-patterns U 1 of D 1 and U 2 of D 2 using PADS, and initiate two borders {φ}, U 1 and {φ}, U 2. Then, we make the differential between those two borders. Let U 1 = {X 1, X 2,..., X n } and U 2 = {Y 1, Y 2,..., Y m } where X i and Y j are itemsets, the left bound of the differential border is computed by L 1 = ni (PowerSet(X i ) m j (PowerSet(Y j ))). The right bound U 1 remains the same. Lastly, form a border L 1, U 1, and the set intervals [L 1, U 1 ] of L 1, U 1 are emerging patterns in D 1. As the size of datasets grow, the Border-Differential would become problematic because it involves set enumerations, resulting in exponential computational costs. The work in [8], a more recent version of [7], proposed several optimization techniques to improve the efficiency of Border-Differential. However, in fact, the complexity of finding emerging patterns is MAX SNP-hard, which means that polynomial time approximation schemes do not exist unless P = NP [22].

CHAPTER 2. LITERATURE REVIEW 13 2.3.2 The DPMiner Algorithm The work in [4] used the FP-tree and patten-growth methods to mine emerging patterns, but it still needs to call Border-Differential to find emerging patterns. The DPMiner (stands for Discriminative Pattern Miner) in [16] also uses FP-tree but mines emerging patterns in a different way. It finds closed frequent patterns and frequent generators simultaneously to form equivalent classes of such patterns, and then determine emerging patterns as nonredundant δ-discriminative equivalent classes [16]. An equivalent class EC is a set of itemsets that always occur together in some transactions of dataset D [16]. It can be uniquely represented by its set of frequent generators G and closed frequent patterns C, in the form of EC = [G, C]. Suppose D can be divided into various classes, denoted as D = D 1 D 2... D n. Let δ be a small integer (usually 1 or 2) and θ be a minimal support threshold. An equivalent class EC is a δ-discriminative equivalent class, provided that its closed pattern C s support is greater than θ in D 1 but smaller than δ in D D 1 = D 2... D n. Furthermore, EC is a non-redundant δ-discriminative equivalent class if and only if (1) it is δ-discriminative, (2) there exists no ÊC such that Ĉ C, where Ĉ and C are the closed patterns of ÊC and EC respectively. The closed frequent patterns of a non-redundant δ-discriminative equivalent class are emerging patterns in D 1. Data Structures and Computational Steps of The DPMiner The high efficiency of the DPMiner is mainly attributed to its revised FP-tree structure. Unlike traditional FP-trees, it does not store items appearing in every transaction and hence have a full support in D. These items are removed because they cannot form generators. Such modification results in a much smaller FP-tree compared to the original. The computational framework of the DPMiner consists of the following five steps: (1). Given k classes of data D 1, D 2,..., D k as input, obtain a union of them to get D = D 1 D 2... D k. Also specify a minimal support threshold θ and a maximal threshold δ (thus, patterns with supports above θ in D i but below δ in D D i are candidate emerging patterns in D i ). (2). Construct a FP-tree based on D and run a depth-first search on the tree to find frequent generators and closed patterns simultaneously. For each search path along the tree, the search terminates whenever a δ-discriminative equivalent class is reached.

CHAPTER 2. LITERATURE REVIEW 14 (3). Determine the class label distribution for every closed pattern, i.e., find in which class a closed pattern has the highest support. This step is necessary because patterns are not mined separately for each D i (1 i k), but rather on the entire D. (4). Pair up generators and closed frequent patterns to form δ-discriminative equivalent classes. (5). Output the non-redundant δ-discriminative equivalent classes as emerging patterns. If a pattern is labeled as i (1 i k), then it is an emerging pattern in D i. 2.4 Summary In this chapter, we discussed previous research addressing data cubing, frequent pattern mining and emerging pattern mining, all of which are essential for our project. Algorithms (the Bottom-Up Cubing, the Border-Differential and the DPMiner) closely related to our work have been described in detail.

Chapter 3 Motivation In this chapter, we motivate the problem of contrast pattern based document analysis. We explain why contrast patterns (in particular, the emerging patterns) are useful, and why data cubes should be used in analyzing documents in multidimensional text databases. 3.1 Motivation for Mining Contrast Patterns This section answers the following two questions: (1) Why we need to mine and use contrast patterns to analyze web documents? (2) How useful are those patterns? In other words, can they make a significant contribution to a good text mining application? We answer these questions by introducing motivating scenarios in real life. Example (Contrast Patterns in Documents) Since the Calgary 1988 Olympic Winter Games, Canada has not been a host country for the Olympic Games for 22 years. Therefore, people may want to know what are the most attractive and discriminative features of the Vancouver 2010 Winter Olympics, compared to all previous Olympic Games. Indeed, there are exciting and touching stories in almost all Olympics and Vancouver certainly has its unique moments. For example, the Canadian figure skater Joannie Rochette won a bronze medal under the keenly felt pain of losing her mother a day before her event started. Suppose a user searches the web and Google returns her a collection of documents on Olympics, consisting of many online sports news and commentaries. There may be too much information for her to read through and find unique stories about Vancouver 2010. Although there is no doubt that Joannie Rochette s accomplishment will occur frequently in articles related to Vancouver 2010, a user who is previously unaware about Rochette may 15

CHAPTER 3. MOTIVATION 16 not be able to learn about her quickly from the search results. Similar situations may also happen when users compare products online by searching and reading reviews by previous buyers. Here is an example we have seen in Section 1.3: Suppose a user is comparing Dell s laptop computers with Sony s. She probably wants to know the special features of Dell which are not owned by Sony s. For example, many reviewers would speak in favor of Dell by commenting high performance-price ratio but would not do that for Sony as it is not the case. Then high performance-price ratio is a pattern contrasting Dell laptops with Sony laptops. To let the users manually determine such contrast patterns is not feasible. Therefore, given a collection of documents, which are ideally pre-classified and stored into a multidimensional text database, we need to develop efficient data models and corresponding algorithms to determine contrast patterns in documents of different classes. As mentioned in Section 1.3, we choose the emerging pattern [7] since it is a representative class of contrast patterns widely used in data mining. Also, there are good algorithms [4, 7, 16] available for efficient mining of such patterns. Moreover, emerging patterns can make a contribution to some other problems in text mining. A novel document classifier could be constructed based on those patterns as they are claimed useful in building accurate classifiers [8]. Also, since emerging patterns are able to capture discriminative features of a class of data, they may be helpful in extracting keywords to summarize the given text. 3.2 Motivation for Utilizing The Data Cube In many real-life database applications, documents and the text data within them are stored in multidimensional text databases [24]. These kinds of databases are distinct from traditional data sources we deal with, including relational databases, transaction databases, and text corpora. Formally, a multidimensional text database is defined as a relational database with text fields. A sample text database is shown in Table 3.1. The first three dimensions (Event, Time, and Publisher) are standard dimensions, just like those in relational databases. The last column contains text dimensions which are documents with text terms. Text databases provide structured attributes of documents, and the information needs of users vary where such needs can be modeled hierarchically. This makes OLAP and data cubes applicable. For instance (using Table 3.1), if a user wants to read news on the ice hockey games reported by the Vancouver Sun on February 20, 2010, then two documents d 1

CHAPTER 3. MOTIVATION 17 Event Time Publisher... Text Data: Documents Ice hockey 2010/2/20 Vancouver Sun... d 1 = {t 1, t 2, t 3, t 4 } Ice hockey 2010/2/23 Global and Mail... d 2 = {t 2, t 3, t 7, t 8 } Ice hockey 2010/2/20 Vancouver Sun... d 3 = {t 1, t 2, t 3, t 6 } Figure skating 2010/2/20 Global and Mail... d 4 = {t 2, t 4, t 6, t 7 } Figure skating 2010/2/20 Vancouver Sun... d 5 = {t 1, t 3, t 5, t 7 } Curling 2010/2/23 New York Times... d 6 = {t 2, t 5, t 7, t 9 } Curling 2010/2/28 Global and Mail... d 7 = {t 3, t 6, t 8, t 9 }............... Table 3.1: A multidimensional text database concerning Olympic news. and d 3 matching the query {Event = Ice hockey, Time = 2010/2/20, Publisher = Vancouver Sun} will be returned to her. If another user wants to skim all Olympic news reported by the Vancouver Sun on that day, we shall roll up to query {Event =, Time = 2010/2/20, Publisher = Vancouver Sun} and return documents d 1, d 3 and d 5 to her. The opposite operation of roll-up is called drill-down. In fact, roll-up and drill-down are two OLAP operations of great importance [11]. Therefore, to meet different levels of information needs, it is natural for us to apply the data cube to model and extend this text database. This is exactly what the previous work in [17, 24, 26] did. 3.3 Summary In light of the above, this chapter shows that contrast patterns are useful in analyzing large scale text data and they are able to give concise information about the data. Also, the nature of multidimensional text databases makes OLAP and the most essential OLAP tool, the data cube, particularly suitable for modeling and analyzing text data in documents.

Chapter 4 Our Methodology In this chapter, we describe our methodology to tackle the contrast pattern based document analysis by building a novel integrated data model through BUC data cubing [5] and two emerging pattern mining algorithms, the Border-Differential [7] and the DPMiner [16]. Section 4.1 formulates the problem we try to address in this work. Section 4.2 describes the processing framework and our algorithms, from both data integration level and algorithm integration level. Section 4.3 discusses issues related to implementation. 4.1 Problem Formulation 4.1.1 Normalizing Data Schema for Text Databases Suppose a collection of web documents are stored in a multidimensional text database. The text data in documents are collected under the schema containing a set of standard non-text dimensions {SD 1, SD 2,..., SD n }, and a set of text dimensions (terms) {TD 1, TD 2,..., TD m }, where m is the number of distinct text terms in this collection. For simplicity, text terms can be mapped to items, so documents can be mapped to transactions, or itemsets (sets of items that appear together). This mapping is similar to the bag-of-words model, which represents text data as an unordered collection of words, disregarding word order and count. In that sense, a multidimensional text database can be mapped to a relational base table with a transaction database. Under the above mapping mechanism, each tuple in a text database corresponds to a certain document, in the form of S, T, where S is the set of standard dimension attributes 18

CHAPTER 4. OUR METHODOLOGY 19 and T is a transaction. The dimension attributes can be learned through a certain classifier or labeled artificially. Words in the document are tokenized and each distinct token will be treated as an item in the transaction. For example, the tuple corresponding to the first row in Table 3.1 is Ice hockey, 2010/2/20, Vancouver Sun,..., d 1 = {t 1, t 2, t 3, t 4 }, with d 1 = {t 1, t 2, t 3, t 4 } being the transaction. Furthermore, we normalize text database tuples to derive a simplified data schema. We map standard dimensions to letters, e.g, Event to A, Time to B and Publisher to C, to make them unified. Likewise, dimension attributes are mapped to items in the same manner: Ice hockey is mapped to a 1, Figure skating is mapped to a 2 and so on. Table 4.1 shows a normalized dataset derived from the Olympic news database (Table 3.1). A B C... Transactions a 1 b 1 c 1... d 1 = {t 1, t 2, t 3, t 4 } a 1 b 2 c 2... d 2 = {t 2, t 3, t 7, t 8 } a 1 b 1 c 1... d 3 = {t 1, t 2, t 3, t 6 } a 2 b 1 c 2... d 4 = {t 2, t 4, t 6, t 7 } a 2 b 1 c 1... d 5 = {t 1, t 3, t 5, t 7 } a 3 b 2 c 3... d 6 = {t 2, t 5, t 7, t 9 } a 3 b 3 c 2... d 7 = {t 3, t 6, t 8, t 9 }............... Table 4.1: A normalized dataset derived from the Olympic news database. 4.1.2 Problem Modeling with Normalized Data Schema Given a normalized dataset as a base table, we build our integrated cube-based data model by computing a full data cube grouped by all standard dimensions (e.g., {A, B, C} in the above table). In the data cubing process, every subset of {A, B, C} will be gone through to form a group-by corresponding to a set of cells. Emerging patterns in each cell will be mined simultaneously and stored as cell measures. When materializing each cell, we aggregate tuples whose dimension attributes match this particular cell. The transactions of matched tuples form the target class (or positive class), denoted as T C. We also virtually aggregate all unmatched tuples and extract their transactions to form the background class (or negative class), denoted as BC. The membership in T C and BC varies from cell to cell; both classes are dynamically computed and formed for each cell.

CHAPTER 4. OUR METHODOLOGY 20 A transaction T is a full itemset in a tuple. A pattern X is a sub-itemset of T having a non-zero support (i.e., the number of times X appears) in the given dataset. Let θ be the minimal support threshold for T C and δ be the maximal support threshold for BC. Pattern X is an emerging pattern in T C if and only if support(x, T C) θ and support(x, BC) δ. In other words, the support of X grows significantly from BC to T C, exceeding a minimal growth rate threshold ρ = θ/δ. Mathematically, growth rate(x) = support(x, T C) / support(x, BC) ρ. Note that δ can be 0, hence ρ = θ/δ =. If growth rate(x) =, X is a jumping emerging pattern [7] which does not appear in BC at all. Given predefined support thresholds θ and δ, for each cell in this cube-based model, we mine all patterns whose support is above θ in the target class T C and below δ in its background class BC. Thus, such patterns automatically exceed the minimal growth rate threshold ρ, and become a measure of this cell. Upon obtaining all cells and corresponding emerging patterns, the model building process is complete. The entire process is based on data cubing and also requires a seamless integration of cubing and emerging pattern mining. Example: Now let us consider a simple example regarding the base table in Table 4.1. Let θ = 2 and δ = 1. Suppose at a certain stage, we are carrying out the group-by operation on dimension A. We get three cells: (a 1,, ), (a 2,, ) and (a 3,, ). For cell (a 1,, ) aggregating the first three tuples in Table 4.1, T C = {d 1, d 2, d 3 }, BC = {d 4, d 5, d 6, d 7 }. Then consider pattern X = (t 1, t 2, t 3 ). It appears twice in T C (in d 1 and d 3 ) but zero times in BC, so support(x, T C) θ and support(x, BC) < δ. In that sense, X = (t 1, t 2, t 3 ) is an (jumping) emerging pattern in T C and hence is a measure of cell (a 1,, ). 4.2 Processing Framework To recapitulate, Chapter 1 introduced the contrast pattern based document analysis in multidimensional text databases. We follow the idea of using data cubes and OLAP to analyze multidimensional text data, and propose to merge the BUC data cubing process with two different emerging pattern mining algorithms (the Border-Differential and the DPMiner) to build an integrated data model based on the data cube. This model is designed to support the contrast pattern based document analysis. In this section, following the problem formulation in Section 4.1, we propose our algorithm to integrate emerging pattern mining into data cubing. The entire processing framework includes both data integration and algorithm integration.

CHAPTER 4. OUR METHODOLOGY 21 4.2.1 Integrating Data To begin with, we reproduce Table 4.1 (with slight revisions) to make the following discussion clear. It shows a standard and ideal format of data that simplifies a multidimensional text database. The data used in our testing will strictly follow this format: each row in a certain dataset D is a tuple in the form of S, T, where S is the set of dimension attributes and T is a transaction. Tuple No. A B C F Transactions 1 a 1 b 1 c 1 f 1 d 1 = {t 1, t 2, t 3, t 4 } 2 a 1 b 2 c 2 f 1 d 2 = {t 2, t 3, t 7, t 8 } 3 a 1 b 1 c 2 f 2 d 3 = {t 1, t 2, t 3, t 6 } 4 a 2 b 1 c 2 f 2 d 4 = {t 2, t 4, t 6, t 7 } 5 a 2 b 1 c 1 f 1 d 5 = {t 1, t 3, t 5, t 7 } 6 a 3 b 2 c 3 f 3 d 6 = {t 2, t 5, t 7, t 9 } 7 a 3 b 3 c 2 f 3 d 7 = {t 3, t 6, t 8, t 9 } 8 a 4 b 2 c 3 f 1 d 8 = {t 6, t 8, t 11, t 12 } Table 4.2: A normalized dataset reproduced from Table 4.1. The integration of data is indispensable because of the nature of the multidimensional text mining problem. In addition, data cubing and emerging patten mining algorithms work with data from heterogeneous sources originally. Data cubing mainly deals with relational base tables in data warehouses, while emerging pattern mining concerns transaction databases (see an example in Table 2.4). Therefore, we should unify heterogeneous data first and then develop algorithms for a seamless integration. Thus, we model the text database and its normalized schema (Table 4.2) by appending transaction database tuples to relational base table tuples. Moreover, for the integrated data, we also apply one of the optimization techniques discussed in Section 2.1.2: mapping all dimension attributes in various kinds of formats to integers between zero and the cardinality of the attribute [11]. For example, in Table 4.2, dimension A has the cardinality A = 3, so in implementation and testing, we map a 1 to 0, a 2 to 1 and a 3 to 2. Similarly, items in transactions are also mapped to integers ranging between one to the total number of items in this dataset. For instance, if all items in a dataset are labeled from t 1 to t 100, we can represent them by integers ranging from 1 to 100. This kind of mapping facilitates sorting and hashing in data cubing. Particularly for BUC, such mapping allows the use of the linear counting sort algorithm to reorder input tuples

CHAPTER 4. OUR METHODOLOGY 22 efficiently. 4.2.2 Integrating Algorithms Our algorithm integrates data cubing and emerging pattern mining seamlessly. It carries out a depth-first search (DFS) to build data cubes and mine emerging patterns as cell measures simultaneously. The algorithm is designed to work on any valid integrated datasets like Table 4.2 (both dimension attributes and transactions should be non-empty for tuples). We outline the algorithm in the following pseudo-code (adapted from [5]). Algorithm Procedure ButtomUpCubeWithDPMiner(data, dim, theta, delta) Inputs: data: the dataset upon which we build our integrated model. dim: number of standard dimensions in input data. theta: the minimal support threshold of candidate emerging patterns in the target class. delta: the maximal support threshold of candidate emerging patterns in the background class. Outputs: cells with their measures (patterns) Method: 1: aggregate(data); 2: if (data.count == 1) then 3: writeancestors(data, dim); 4: return; 5: endif 6: for each dimension d (from 0 to (dim - 1)) do 7: C := cardinality(d); 8: newdata := partition(data, d); // counting sort. 9: for each partition i (from 0 to (C - 1)) do 10: cell := createemptycell(); 11: posdata := newdata.gatherpositivetransactions(); 12: negdata := newdata.gathernegativetransactions(); 13: isduplicate := determinecoverage(posdata, negdata); 14: if (!isduplicate) then